Important Note

This script will not run without the actual ADNI data downloaded from ADNI. Please run “DA5030.Proj.Bryant.Simulated_ADNI.Rmd” to demonstrate code functionality. This has been submitted to show the exact code I used to generate my GitHub pages website using the actual ADNI data.

This script and corresponding output use real data downloaded from the Alzheimer’s Disease Neuroimaging Initiative. Part of the data agreement terms for ADNI users is that data not be published externally, so I cannot directly share the data upon which I ran my own analysis. However, I have simulated the data using the identical data structure to that downloaded from ADNI and published these simulated datasets as CSVs at [my GitHub repo]((https://github.com/anniegbryant/DA5030_Final_Project). Any reader who wishes to access the actual dataset used for my analysis should register for an ADNI account (free) and refer to the specific data files described in Data Understanding.

Additionally, this project has been published as a GitHub pages website containing figures and analysis created with the actual ADNI dataset, and can be found here: https://anniegbryant.github.io/DA5030_Final_Project/

Lastly, the Shiny app deployed based on this project can be found here: https://annie-bryant.shinyapps.io/TauPET_Shiny_App_Notebook/

   

Phase One: Business Understanding

Determining Business Objectives

For my final project for DA5030 “Data Mining and Machine Learning”, my objective is to leverage neuroimaging-based data to predict cognitive decline in subjects along the cognitive spectrum from cognitively unimpaired to severe dementia. The goal is to identify specific brain regions that, when burdened by Alzheimer’s Disease-related pathology, confer predictive power onto cognitive status, measured via neuropsychological assessment. Ideally, I would like to identify the regions of interest (ROIs) in the brain that change the most with decreasing cognitive ability and to refine a set of ROIs that collectively predict changes to cognitive assessment scores. This will be (tentatively) regarded as a success if one or more ROIs can explain more than 50% variance in cognitive assessment scores (i.e. R\(^2\) > 0.5).

Assessing the Situation

I will focus on one specific form of neuroimaging: Positron Emission Tomography (PET). PET imaging enables the visualization of specific molecular substrates in the brain through the use of radioactively-labeled tracers that bind the target substrate. In this case, I have chosen to focus on PET that binds to the protein tau, which exhibits characteristic misfolding in Alzheimer’s Disease (AD). Misfolded tau not only loses its normal function, but it also aggregates into intracellular neurofibrillary tangles (NFTs) that can disrupt neuronal signaling and promote neurodegeneration. This phenomenon typically follows an archetypical spreading pattern beginning in the entorhinal cortex, progressing out to the hippocampus and amygdala, and then spreading out beyond the medial temporal lobe to the limbic system and onto the neocortex. This staging pattern is well-defined following the seminal paper published by Braak & Braak in 1991; the stages of tau NFT pathology progression are now known as the Braak stages. There are six stages of tau NFT progression in total.

Such staging has traditionally only been possible at autopsy, as it requires careful immunohistochemical staining of several brain regions by an experienced neuropathologist. However, recent years have seen the development of tau-PET tracers that are specific to misfolded NFT tau. One tracer in particular, 18F-AV-1451, has become widely-used in the last few years as a non-invasive biomarker to measure regional accumulation of tau in the human brain. Tau-PET uptake correlates well with the typical postmortem Braak staging patterns (Schwarz et al. 2016) as well as cognitive status (Zhao et al. 2019). Recent studies have utilized machine learning algorithms with tau-PET neuroimaging, as well as other (relatively) non-invasive biomarkers including amyloid-beta PET and cerebrospinal fluid (CSF) protein measurements, to collectively predict onset of dementia (Mishra et al. 2017) or to predict the spread of tau NFT pathology in the brain (Vogel et al. 2019, 2020). However, longitudinal analysis of tau-PET accumulation and its relationship to cognition remains relatively unexplored as of yet, largely owing to the recentness of tau-PET tracer development.

Resource Inventory

Through my role as a research assistant at the MassGeneral Institute for Neurogenerative Disease, I have worked with the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data repository previously. ADNI is a tremendous resource for imaging-based and molecular biomarker data acquired from thousands of research participants across the country (see Acknowledgments for more information). In 2016, ADNI incorporated 18F-AV-1451 tau-PET neuroimaging into its imaging protocol, and has since amassed well over a thousand tau-PET scans since then. Researchers at UCSF have processed many of these images and quantified regional uptake of the tau-PET tracer, and have generously shared their regional tau-PET data for ADNI collaborators to access. ADNI has also compiled cognitive assessment scores for each subject. I will utilize these two resources to develop individual regression models as well as an ensemble model to predict cognitive decline as a function of pathological tau NFT accumulation throughout the brain.

Requirements, Assumptions, Constraints

The only constraint is that I cannot directly share the full dataset as downloaded from ADNI, though I encourage anyone interested in gaining access to register for free at http://adni.loni.usc.edu/. Instead, I wrote a custom encryption function to simulate fake subject identifier IDs, exam dates, tau-PET uptake values, region of interest volumes, age, sex, and CDR-Sum of Boxes scores based on the existing data distribution. These simulated datasets are hosted in my GitHub repo. Please note that the actual ADNI data are used in this file.

Determine Data Mining Goals

My goal in this analysis is to develop a model that can predict change in cognitive status through some combination (linear or nonlinear) of multiple brain regions, each of which exhibit a different change in tau-PET uptake. In doing this, I also hope to identify which region(s) of the brain are most prone to accumulation of tau NFT pathology as measured via PET, and in turn, which region(s) can best predict cognitive decline.

Data Mining Success Criteria

The target feature in this project will be a continuous measurement representing a score on a cognitive assessment score (CDR Sum of Boxes – see Data Understanding). Therefore, models will be evaluated based on their root mean squared error (RMSE) and the R\(^2\) between predicted versus real cognitive scores. I have set a benchmark of success at R\(^2\) > 0.5, meaning the model explains at least 50% of variance seen in cognitive score changes. This is an ambitious threshold, as cognitive status is multifactorial and certainly modulated by more than regional tau accumulation, but this figure will distinguish stronger versus weaker predictive models.

Phase 2: Data Understanding

All packages used in this file:

# General data wrangling
library(tidyverse)
library(knitr)
library(kableExtra)
library(DT)
library(lubridate)
library(readxl)
library(fakeR)

# Modeling
library(factoextra)
library(FactoMineR)
library(glmnet)
library(caret)
library(ranger)
library(caretEnsemble)
library(Hmisc)

# Visualization
library(plotly)
library(forcats)
library(ggsignif)
library(ggcorrplot)
library(psych)
library(GGally)
library(gridExtra)
library(colorRamps)
library(RColorBrewer)
library(colorspace)
library(NeuralNetTools)
library(ggplotify)
library(igraph)

# ggseg is used to visualize the brain
# remotes::install_github("LCBC-UiO/ggseg")
# If that doesn't work: 
# download.file("https://github.com/LCBC-UiO/ggseg/archive/master.zip", "ggseg.zip")
# unzip("ggseg.zip")
# devtools::install_local("ggseg-master")
library(ggseg)

# remotes::install_github("LCBC-UiO/ggseg3d")
library(ggseg3d)

# remotes::install_github("LCBC-UiO/ggsegExtra")
library(ggsegExtra)

Tau-PET Data

Data overview

The longitudinal tau-PET dataset was downloaded as a CSV from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Study Data repository located at Study Data/Imaging/PET Image Analysis/UC Berkeley - AV1451 Analysis [ADNI2,3] (version: 5/12/2020). This CSV file contains 1,121 rows and 165 columns. Note:ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

Data loading

On my end, I load partial volume corrected regional tau-PET data, as downloaded from ADNI:

tau.df <- read.csv("../ADNI_Data/Raw_Data/UCBERKELEYAV1451_PVC_05_12_20.csv")
tau.df$EXAMDATE = as.Date(tau.df$EXAMDATE, format="%m/%d/%Y")

# update stamp is irrelevant, drop it
tau.df <- select(tau.df, -update_stamp)

However, since I can’t share the tau-PET data directly from ADNI, I’ve “simulated” this dataset using a custom encryption function (sourced locally from encrypt_df.R, not on GitHub) that modifies the columns as follows:

  • RID: adds a constant integer to the subject ID; this way, the same subject keeps the same ID, but they are untraceable to ADNI’s data
  • EXAMDATE: adds a random number of days between -15 and 15 to each visit
  • SUVR: adds a random number from uniform distribution [-0.5,0.5] to the tau-PET uptake value
  • VOLUME: adds a random integer to the volume ranging from -300 to 300

This way, the data form remains approximately similar to that which I downloaded from ADNI, without sharing any traceable information to one specific individual.

# NOT RUN; just here to show how I simulated the other dataset
source("encrypt_df.R")
tau.df <- encrypt_pet(tau.df)
tau.df$RID <- round(tau.df$RID)
write.csv(tau.df, "Simulated_ADNI_TauPET.csv", row.names = F)

Each row in the CSV represents one tau-PET scan (see str call below). Some subjects had repeated scans separated by approximately one year, while other subjects had only one scan. Columns include subject information including anonymized subject ID, visit code, and PET exam date. The other columns encode regional volume and tau-PET uptake. Specifically, there are 80 distinct cortical and subcortical regions of interest (ROIs), each of which has a volume field (in mm^3) and a tau-PET uptake field, called the Standardized Uptake Value Ratio (SUVR).

str(tau.df)
## 'data.frame':    1120 obs. of  164 variables:
##  $ RID                                   : int  21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE                               : chr  "init" "init" "y1" "init" ...
##  $ VISCODE2                              : chr  "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                              : Date, format: "2018-02-02" "2018-04-24" ...
##  $ INFERIOR_CEREBGM_SUVR                 : num  1.32 1.33 1.33 1.28 1.24 ...
##  $ INFERIOR_CEREBGM_VOLUME               : int  52306 54296 54296 56750 56750 56750 59836 56862 56862 56862 ...
##  $ HEMIWM_SUVR                           : num  1.02 0.85 0.866 1.138 1.196 ...
##  $ HEMIWM_VOLUME                         : int  321220 281690 281690 336495 336495 336495 294422 463900 463900 463900 ...
##  $ BRAAK12_SUVR                          : num  2.06 2.24 2.3 1.91 1.88 ...
##  $ BRAAK12_VOLUME                        : int  10275 7587 7587 9376 9376 9376 10379 10981 10981 10981 ...
##  $ BRAAK34_SUVR                          : num  1.95 1.87 1.8 1.82 1.77 ...
##  $ BRAAK34_VOLUME                        : int  95661 95419 95419 92482 92482 92482 94092 112788 112788 112788 ...
##  $ BRAAK56_SUVR                          : num  1.99 1.92 1.84 1.87 1.84 ...
##  $ BRAAK56_VOLUME                        : int  284821 288136 288136 283119 283119 283119 283727 325054 325054 325054 ...
##  $ BRAIN_STEM_SUVR                       : num  1.27 1.12 1.12 1.2 1.17 ...
##  $ BRAIN_STEM_VOLUME                     : int  16955 16952 16952 20508 20508 20492 18057 18872 18872 18866 ...
##  $ LEFT_MIDDLEFR_SUVR                    : num  2.02 1.93 1.8 1.83 1.78 ...
##  $ LEFT_MIDDLEFR_VOLUME                  : int  17640 18517 18517 17164 17164 17164 17683 21907 21907 21907 ...
##  $ LEFT_ORBITOFR_SUVR                    : num  2.17 2.03 1.92 2.11 1.98 ...
##  $ LEFT_ORBITOFR_VOLUME                  : int  11676 10091 10091 11721 11721 11721 10917 12109 12109 12109 ...
##  $ LEFT_PARSFR_SUVR                      : num  2.02 2.01 1.98 2.03 1.99 ...
##  $ LEFT_PARSFR_VOLUME                    : int  9201 7799 7799 9185 9185 9185 7709 9813 9813 9813 ...
##  $ LEFT_ACCUMBENS_AREA_SUVR              : num  1.14 1.04 1.79 1.12 1.18 ...
##  $ LEFT_ACCUMBENS_AREA_VOLUME            : int  500 318 318 308 308 308 353 361 361 361 ...
##  $ LEFT_AMYGDALA_SUVR                    : num  1.31 1.54 1.63 1.42 1.37 ...
##  $ LEFT_AMYGDALA_VOLUME                  : int  1367 1224 1224 1561 1561 1561 993 1499 1499 1499 ...
##  $ LEFT_CAUDATE_SUVR                     : num  2.08 1.46 1.34 1.95 1.83 ...
##  $ LEFT_CAUDATE_VOLUME                   : int  3016 4890 4890 3083 3083 3083 2874 4049 4049 4049 ...
##  $ LEFT_HIPPOCAMPUS_SUVR                 : num  2.12 1.96 2.2 1.69 1.73 ...
##  $ LEFT_HIPPOCAMPUS_VOLUME               : int  3822 3050 3050 3476 3476 3476 3603 3550 3550 3550 ...
##  $ LEFT_PALLIDUM_SUVR                    : num  3.79 1.89 1.95 2.5 2.6 ...
##  $ LEFT_PALLIDUM_VOLUME                  : int  444 2066 2066 1301 1301 1301 1081 1634 1634 1634 ...
##  $ LEFT_PUTAMEN_SUVR                     : num  1.69 1.64 1.42 1.9 1.78 ...
##  $ LEFT_PUTAMEN_VOLUME                   : int  4000 5675 5675 4832 4832 4832 3563 4891 4891 4891 ...
##  $ LEFT_THALAMUS_PROPER_SUVR             : num  1.45 1.32 1.24 1.54 1.53 ...
##  $ LEFT_THALAMUS_PROPER_VOLUME           : int  8226 6195 6195 7114 7114 7114 7561 7518 7518 7518 ...
##  $ RIGHT_MIDDLEFR_SUVR                   : num  2.08 1.91 1.8 1.94 1.85 ...
##  $ RIGHT_MIDDLEFR_VOLUME                 : int  17250 18440 18440 15605 15605 15605 16280 22586 22586 22586 ...
##  $ RIGHT_ORBITOFR_SUVR                   : num  2.19 2.01 1.86 2.17 2.03 ...
##  $ RIGHT_ORBITOFR_VOLUME                 : int  11614 12637 12637 11064 11064 11064 11537 12575 12575 12575 ...
##  $ RIGHT_PARSFR_SUVR                     : num  2.17 2.08 1.9 2.09 2.01 ...
##  $ RIGHT_PARSFR_VOLUME                   : int  9255 8131 8131 9641 9641 9641 8839 9119 9119 9119 ...
##  $ RIGHT_ACCUMBENS_AREA_SUVR             : num  1.41 1.65 1.66 1.01 1.07 ...
##  $ RIGHT_ACCUMBENS_AREA_VOLUME           : int  545 413 413 423 423 423 542 528 528 528 ...
##  $ RIGHT_AMYGDALA_SUVR                   : num  1.18 1.79 1.89 1.37 1.44 ...
##  $ RIGHT_AMYGDALA_VOLUME                 : int  1268 1028 1028 1464 1464 1464 1313 1797 1797 1797 ...
##  $ RIGHT_CAUDATE_SUVR                    : num  2.01 1.57 1.37 1.96 1.89 ...
##  $ RIGHT_CAUDATE_VOLUME                  : int  3179 4854 4854 2984 2984 2984 3224 3835 3835 3835 ...
##  $ RIGHT_HIPPOCAMPUS_SUVR                : num  2.01 2.09 2.03 1.62 1.64 ...
##  $ RIGHT_HIPPOCAMPUS_VOLUME              : int  3978 2723 2723 3489 3489 3489 3667 3942 3942 3942 ...
##  $ RIGHT_PALLIDUM_SUVR                   : num  3.01 2.32 2.12 2.33 2.48 ...
##  $ RIGHT_PALLIDUM_VOLUME                 : int  846 1531 1531 1262 1262 1262 1088 1552 1552 1552 ...
##  $ RIGHT_PUTAMEN_SUVR                    : num  1.68 1.62 1.53 2.06 1.94 ...
##  $ RIGHT_PUTAMEN_VOLUME                  : int  4322 5774 5774 4328 4328 4328 3190 4569 4569 4569 ...
##  $ RIGHT_THALAMUS_PROPER_SUVR            : num  1.42 1.33 1.24 1.52 1.55 ...
##  $ RIGHT_THALAMUS_PROPER_VOLUME          : int  5968 5442 5442 5940 5940 5940 6257 7899 7899 7899 ...
##  $ CHOROID_SUVR                          : num  7.45 4.56 4.31 3.84 3.79 ...
##  $ CHOROID_VOLUME                        : int  4180 3591 3591 3165 3165 3165 3717 3663 3663 3663 ...
##  $ CTX_LH_BANKSSTS_SUVR                  : num  1.75 1.49 1.6 1.7 1.63 ...
##  $ CTX_LH_BANKSSTS_VOLUME                : int  1553 1633 1633 1812 1812 1812 1694 2601 2601 2601 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_SUVR   : num  1.67 1.73 1.65 1.69 1.69 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE_VOLUME : int  1138 1387 1387 1124 1124 1124 1465 1512 1512 1512 ...
##  $ CTX_LH_CUNEUS_SUVR                    : num  2.33 2.2 2.05 2.01 2 ...
##  $ CTX_LH_CUNEUS_VOLUME                  : int  2023 2702 2702 2429 2429 2429 2393 2222 2222 2222 ...
##  $ CTX_LH_ENTORHINAL_SUVR                : num  2.07 2.3 2.43 2.79 2.52 ...
##  $ CTX_LH_ENTORHINAL_VOLUME              : int  1468 1035 1035 1068 1068 1068 1297 1888 1888 1888 ...
##  $ CTX_LH_FUSIFORM_SUVR                  : num  1.97 1.87 1.83 1.84 1.77 ...
##  $ CTX_LH_FUSIFORM_VOLUME                : int  7956 6997 6997 7694 7694 7694 7807 9083 9083 9083 ...
##  $ CTX_LH_INFERIORPARIETAL_SUVR          : num  1.99 1.95 1.94 1.85 1.89 ...
##  $ CTX_LH_INFERIORPARIETAL_VOLUME        : int  11656 10174 10174 9243 9243 9243 8180 9846 9846 9846 ...
##  $ CTX_LH_INFERIORTEMPORAL_SUVR          : num  2.16 1.97 2.05 2.1 2.02 ...
##  $ CTX_LH_INFERIORTEMPORAL_VOLUME        : int  6606 6418 6418 7286 7286 7286 6869 9599 9599 9599 ...
##  $ CTX_LH_INSULA_SUVR                    : num  1.51 1.64 1.65 1.51 1.48 ...
##  $ CTX_LH_INSULA_VOLUME                  : int  6711 4654 4654 6003 6003 6003 5513 6597 6597 6597 ...
##  $ CTX_LH_ISTHMUSCINGULATE_SUVR          : num  1.9 1.81 1.82 1.79 1.94 ...
##  $ CTX_LH_ISTHMUSCINGULATE_VOLUME        : int  2283 2215 2215 1549 1549 1549 1944 2264 2264 2264 ...
##  $ CTX_LH_LATERALOCCIPITAL_SUVR          : num  2.39 2.06 1.99 1.92 2 ...
##  $ CTX_LH_LATERALOCCIPITAL_VOLUME        : int  8532 10148 10148 8292 8292 8292 10612 9404 9404 9404 ...
##  $ CTX_LH_LINGUAL_SUVR                   : num  2.27 1.95 1.97 1.76 1.74 ...
##  $ CTX_LH_LINGUAL_VOLUME                 : int  4329 4658 4658 5606 5606 5606 5435 6488 6488 6488 ...
##  $ CTX_LH_MIDDLETEMPORAL_SUVR            : num  2.2 2.06 1.89 2.04 1.99 ...
##  $ CTX_LH_MIDDLETEMPORAL_VOLUME          : int  7445 8322 8322 7292 7292 7292 8031 9467 9467 9467 ...
##  $ CTX_LH_PARACENTRAL_SUVR               : num  1.99 1.79 1.8 1.91 1.8 ...
##  $ CTX_LH_PARACENTRAL_VOLUME             : int  2672 2890 2890 3231 3231 3231 3358 3173 3173 3173 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_SUVR           : num  1.6 1.86 1.92 1.72 1.66 ...
##  $ CTX_LH_PARAHIPPOCAMPAL_VOLUME         : int  1659 1549 1549 1900 1900 1900 1989 2296 2296 2296 ...
##  $ CTX_LH_PERICALCARINE_SUVR             : num  2.23 1.45 1.41 1.56 1.54 ...
##  $ CTX_LH_PERICALCARINE_VOLUME           : int  1678 2004 2004 1866 1866 1866 1918 1927 1927 1927 ...
##  $ CTX_LH_POSTCENTRAL_SUVR               : num  2.03 1.81 1.82 1.85 1.78 ...
##  $ CTX_LH_POSTCENTRAL_VOLUME             : int  8281 8428 8428 8275 8275 8275 7580 8976 8976 8976 ...
##  $ CTX_LH_POSTERIORCINGULATE_SUVR        : num  1.82 1.89 1.84 1.72 1.67 ...
##  $ CTX_LH_POSTERIORCINGULATE_VOLUME      : int  2439 2608 2608 2683 2683 2683 2573 2638 2638 2638 ...
##  $ CTX_LH_PRECENTRAL_SUVR                : num  1.91 1.85 1.75 1.62 1.61 ...
##  $ CTX_LH_PRECENTRAL_VOLUME              : int  11174 12349 12349 10924 10924 10924 10820 12307 12307 12307 ...
##  $ CTX_LH_PRECUNEUS_SUVR                 : num  1.93 1.89 1.94 1.81 1.81 ...
##  $ CTX_LH_PRECUNEUS_VOLUME               : int  7870 8313 8313 8387 8387 8387 8311 8584 8584 8584 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_SUVR  : num  1.71 1.58 1.49 1.59 1.48 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE_VOLUME: int  2928 2448 2448 1695 1695 1695 2466 2915 2915 2915 ...
##  $ CTX_LH_SUPERIORFRONTAL_SUVR           : num  1.86 1.86 1.74 1.84 1.77 ...
##   [list output truncated]

The SUVR value is normalized to the tau-PET uptake in the inferior cerebellum gray matter (highlighted in blue below), a commonly-used region for tau normalization given the lack of inferior cerebellar tau pathology in Alzheimer’s Disease.

aseg_3d %>% 
  unnest(ggseg_3d) %>% 
  ungroup() %>% 
  select(region) %>% 
  na.omit() %>% 
  mutate(val=ifelse(region %in% c("Right-Cerebellum-Cortex", "Left-Cerebellum-Cortex"), 1, 0)) %>%
  ggseg3d(atlas=aseg_3d, label="region", text="val", colour="val", na.alpha=0.5, 
           palette=c("transparent", "deepskyblue3"), show.legend=F) %>%
  add_glassbrain() %>%
  pan_camera("left lateral") %>%
  remove_axes()

All cortical and subcortical ROIs were delineated by first co-registering the tau-PET image to a high-resolution structural T1-weighted MPRAGE acquired in the same imaging session, and then applying FreeSurfer (v5.3) for automated regional segmentation and parcellation. The following diagram from Marcoux et al. (2018) summarizes this process:

Furthermore, to mitigate issues with lower voxel resolution in PET imaging, partial volume correction was applied to use probabilistic tissue segmentation maps to refine individual ROIs. Note: these PET processing steps were all performed by Susan Landau, Deniz Korman, and William Jagust at the Helen Wills Neuroscience Institute, UC Berkeley and Lawrence Berkeley National Laboratory.

18F-AV-1451 is a relatively recent PET tracer, and was only incorporated into the ADNI-3 pipeline beginning in 2016. I am curious about the temporal distribution of the FreeSurfer-analyzed scans here:

tau.df %>%
  # RID = unique subject identifier, EXAMDATE=date of PET scan
  select(RID, EXAMDATE) %>%
  # Create interactive plotly histogram, which auto-formats date along x-axis
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Tau-PET Scan Date Distribution',
         xaxis = list(title = 'Scan Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of PET Scans')) 

Even though ADNI3 officially began in 2016, most scans were acquired from mid-2017 to present day. This doesn’t affect this analysis, though, since the same tau-PET imaging protocol has been used across sites since 2016.

Data distribution

Since this project will explore tau-PET measurements over time, I will be refining the dataset to only subjects with multiple tau-PET scans. Here’s the overall distribution of number of longitudinal scans by subject:

# Plot number of PET scans by subject in a bar chart
p_num_long <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_scans=n()) %>%
  ggplot(., aes(x=fct_reorder(RID, n_scans, .desc=T), y=n_scans)) +
  geom_bar(stat="identity", aes(fill=n_scans, color=n_scans)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of Longitudinal PET Scans per Subject") +
  ylab("Number of PET Scans") +
  xlab("Subject") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 

# Convert to interactive plotly chart
ggplotly(p_num_long)
rm(p_num_long)

The majority of subjects only had one tau-PET scan; given the longitudinal nature of this project, such subjects will eventually be omitted from analysis. On that note, it’s important to know exactly how many subjects do have at least two tau-PET scans:

# Calculate number of subjects with 2+ PET scans
num_scans <- tau.df %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  # Tabulate # scans by subject and filter
  summarise(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_scans=sum(n_scans))
# Print results
cat("Number of subjects with at least two scans: **", 
    num_scans$num_subjects, "**\n", 
    "\nNumber of total PET scans: **", 
    num_scans$total_scans, "**\n", sep="")

Number of subjects with at least two scans: 249

Number of total PET scans: 593

So, we have 249 subjects with two or more scans.

Temporal distribution

Another important consideration is the length of time between each consecutive scan. I will eventually normalize changes in tau-PET to number of years passed to yield an annual rate of change, but it’s good to know what that time interval is:

# Plot the # years in between each pair of consecutive tau-PET scans
p_pet_interval <- tau.df %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  # Calculate difference between pairs of scan dates using lag
  mutate(Years_between_Scans = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  # Omit the first scan, for which the time interval is zero
  filter(Years_between_Scans>0) %>%
  ggplot(., aes(x=Years_between_Scans)) +
  geom_histogram(stat="count", color="lightslategray") +
  ggtitle("Years in between Tau-PET Scans per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive scans for a subject") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 

# Convert to interactive plotly histogram
ggplotly(p_pet_interval)
rm(p_pet_interval)

There’s a cluster of scans around the one-year mark. Presumably, ADNI3 participants are enrolled in an annual tau-PET plan, though in some cases scans aren’t at precise yearly intervals.

Data missingness

I’ll check if there are any missing data points for tau-PET SUVR values for any of the FreeSurfer-derived cortical or subcortical regions of interest (ROIs). Note: this is filtered to show only subjects with at least two scans:

# Calculate number and proportion of missing data records for each measured region of interest (ROI)
tau.df %>%
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  ungroup() %>%
  # filter only to the SUVR columns
  select(!matches("VOLUME")) %>%
  # Reshape from wide --> long
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  group_by(ROI) %>%
  # Calculate number and proportion of missing data points for each ROI
  summarise(`Percent Missing` = sum(is.na(SUVR))/n(),
            `Number Missing` = sum(is.na(SUVR))) %>%
  datatable()

All regions have zero missing data points, so no imputation will be necessary.

SUVR Distribution by Region

Now, I’ll check the distribution of tau-PET uptake values across the ROIs.

# plot the mean tau-PET SUVR for each region of interest
p_roi_suvr <- tau.df %>%
  # omit irrelevant variables
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  # only look at SUVR columns; reshape wide --> long
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = tolower(str_replace(ROI, "_SUVR", ""))) %>%
  group_by(ROI) %>%
  # calculate mean and SD, along with ymin and ymax, to plot error bars for each ROI
  summarise(Mean_SUVR=mean(SUVR, na.rm=T),
            SD_SUVR = sd(SUVR, na.rm=T),
            ymin = Mean_SUVR-SD_SUVR,
            ymax = Mean_SUVR+SD_SUVR) %>%
  # fct_reorder --> arrange ROI by its average tau-PET SUVR
  ggplot(data=., mapping=aes(x=fct_reorder(ROI, Mean_SUVR, .desc=F), 
                             y=Mean_SUVR,
                             label = ROI)) +
  geom_bar(stat="identity", show.legend=F, fill="lightsteelblue") +
  # Add error bars
  geom_errorbar(aes(ymin=ymin, ymax=ymax), width=0, color="lightslategray") +
  coord_flip() +
  theme_minimal() +
  ylab("Mean Tau-PET SUVR") +
  xlab("Region of Interest") +
  ggtitle("Mean Tau-PET SUVR by ROI") +
  theme(axis.text.x=element_text(size=8, angle=45))

# Convert to interactive plotly graph
ggplotly(p_roi_suvr, height=1000, width=600, tooltip=c("label", "y"))
rm(p_roi_suvr)

ROI Normalization

These values are supposed to be normalized to the inferior cerebellum gray matter, indicated by INFERIOR_CEREBGM_SUVR. To confirm, I’ll check the distribution of INFERIOR_CEREBGM_SUVR values.

# Plot distribution of inferior cerebellum gray tau-PET SUVR
p_inf_cb <- tau.df %>%
  select(-VISCODE, -VISCODE2) %>%
  group_by(RID) %>%
  mutate(n_scans=n()) %>%
  filter(n_scans>=2) %>%
  select(-n_scans) %>%
  # Select only SUVR columns for ROIs; pivot wide --> long
  select(!matches("VOLUME")) %>%
  pivot_longer(cols=c(-RID, -EXAMDATE), names_to="ROI", values_to="SUVR") %>%
  mutate(ROI = str_replace(ROI, "_SUVR", "")) %>%
  # Filter to inferior cerebellum gray
  filter(ROI=="INFERIOR_CEREBGM") %>%
  ggplot(data=., mapping=aes(x=SUVR)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("Number of Occurences") +
  xlab("Inferior Cerebellum Gray SUVR") +
  ggtitle("Distribution of Inferior Cerebellum Gray Matter Tau Uptake") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to interactive plotly histogram
ggplotly(p_inf_cb)
rm(p_inf_cb)

Most of the inferior cerebellum gray ROIs actually have SUVRs around 1.25. I’ll re-normalize all ROI values to the mean of this distribution in Data Preparation.

Subject demographics + cognitive assessments

Data overview

The longitudinal subject demographic and cognitive assessment dataset was downloaded as a CSV from the ADNI Study Data repository, located at Study Data/Study Info/Key ADNI tables merged into one table. This CSV file contains 14,816 rows and 113 columns. This includes many subjects with multiple follow-up visits. Columns include (anonymized) subject demographic information such as age, sex, race, and marriage status. Note: ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.

This project will one key target feature in this dataset: Clinical Dementia Rating (CDR) Sum of Boxes. The CDR scale was initially developed in 1982 at the Washington University as a metric of clinical dementia progression (Hughes et al., 1982). This cognitive test measures impairment in six cognitive domains: memory, orientation, judgment and problem solving, community affairs, home and hobbies, and personal care (Morris 1991). Each of these categories is scored on a three-point scale as follows:

  • 0 = no impairment
  • 0.5 = questionable
  • 1 = mild dementia
  • 2 = moderate dementia
  • 3 = severe dementia

The global score is most heavily influenced by the memory score, though there is an established decision tree-like algorithm for how to calculate the global score. By contrast, the CDR Sum of Boxes reflects the sum of scores for each of the six domains, with an overall range of 0 (no impairment) to 18 (severe impairment). The CDR Sum of Boxes is an extension upon the CDR global score, offering a more detailed quantitative score. This metric reportedly improves precision in longitudinal cognitive tracking, particularly in cases of mild dementia (O’Bryant et al., 2008). Of note, the CDR assessment shows high inter-rater reliability, which is important given the inter-site nature of the ADNI cohort (Morris 1991).

Sources:

  • Hughes, C. P., Berg, L., Danziger, W., Coben, L. A., & Martin, R. L. (1982). A new clinical scale for the staging of dementia. The British journal of psychiatry, 140(6), 566-572.
  • Morris, J. C. (1991). The clinical dementia rating (CDR): Current version and scoring rules. Neurology, 43(11), 1588-1592.
  • O’Bryant, S. E., Waring, S. C., Cullum, C. M., Hall, J., Lacritz, L., Massman, P. J., … & Doody, R. (2008). Staging dementia using Clinical Dementia Rating Scale Sum of Boxes scores: a Texas Alzheimer’s research consortium study. Archives of neurology, 65(8), 1091-1095.

Data loading

ADNI compiled a merged dataset containing key information from several tables, including subject demographics, selected cognitive assessment scores, and select biomarker data.

# Read in subject demographic dataset from ADNI
subj.info <- read.csv("../ADNI_Data/Raw_Data/ADNIMERGE.csv", stringsAsFactors = T, na.strings="")
subj.info$EXAMDATE <- as.Date(subj.info$EXAMDATE, "%m/%d/%Y")

Note: as with the tau-PET data, I’ve “simulated” this dataset using a custom encryption function (sourced locally from encrypt_df.R, not on GitHub) that modifies the columns as follows:

  • RID: adds a constant integer to the subject ID; this way, the same subject keeps the same ID across visits and datasets here, but they are untraceable to ADNI’s data
  • AGE: a randomly-sampled number from the uniform distribution is added to all data points for consistency across visits
  • EXAMDATE: adds a random number of days between -15 and 15 to each visit
  • CDRSB: values of zero are left at zero; values greater than zero have a random multiple of 0.5 ranging from -0.5 to 1 added
  • Sex: each subject is randomly assigned Male or Female based on uniform distribution
# NOT RUN; just included to show how I simulated the subject information dataset
library(tidyverse)
source("encrypt_df.R")

# subset the subject data to be simulated:
data.to.sim <- select(subj.info, c(RID, VISCODE, EXAMDATE, AGE, PTGENDER, CDRSB))

# encrypt
subj.encrypted <- encrypt_subj_info(data.to.sim)

# write to csv to be uploaded to GitHub
write.csv(subj.encrypted, "Simulated_ADNI_cognitive_scores.csv", row.names = F)

The cognitive score dataset columns I will be using for this project include:

  • RID: Participant roster ID, which serves as unique subject identifier
  • VISCODE: Visit code
  • EXAMDATE: Date
  • AGE: Age at visit
  • PTGENDER: Biological sex
  • CDRSB: CDR Sum-of-Boxes score at visit
subj.info$EXAMDATE <- as.Date(subj.info$EXAMDATE, format="%m/%d/%Y")
summary(subj.info %>% select(RID, VISCODE, EXAMDATE, AGE, PTGENDER, CDRSB))
##       RID          VISCODE        EXAMDATE               AGE       
##  Min.   :   2   bl     :2272   Min.   :2005-09-07   Min.   :54.40  
##  1st Qu.: 685   m12    :1836   1st Qu.:2008-12-23   1st Qu.:69.10  
##  Median :2074   m06    :1618   Median :2012-06-19   Median :73.50  
##  Mean   :2613   m24    :1451   Mean   :2012-05-02   Mean   :73.48  
##  3rd Qu.:4513   m18    :1292   3rd Qu.:2014-03-31   3rd Qu.:78.30  
##  Max.   :6874   m36    : 857   Max.   :2020-07-27   Max.   :91.40  
##                 (Other):5490                        NA's   :4      
##    PTGENDER        CDRSB       
##  Female:6593   Min.   : 0.000  
##  Male  :8223   1st Qu.: 0.000  
##                Median : 1.000  
##                Mean   : 2.083  
##                3rd Qu.: 3.000  
##                Max.   :18.000  
##                NA's   :4172

There is a lot of missing data in this dataset – however, this dataset includes many subjects and visit dates that don’t correspond to tau-PET scans, and therefore won’t be used in this analysis. Missingness will be re-evaluated once the PET data and subject demographic data is merged in Data Preparation.

Data distribution

The time distribution of ADNI cognitive assessment data can be visualized:

# plotly histogram of cognitive assessment dates
subj.info %>%
  select(RID, EXAMDATE) %>%
  plot_ly(x=~EXAMDATE, type="histogram",
          marker = list(color = "lightsteelblue",
                        line = list(color = "lightslategray",
                                    width = 1.5))) %>%
  layout(title = 'Subject Demographics Date Distribution',
         xaxis = list(title = 'Visit Date',
                      zeroline = TRUE),
         yaxis = list(title = 'Number of Subjects')) 

One thing to note is that tau-PET was only incorporated into the ADNI pipeline in late 2015/early 2016, so any demographic information from pre-2016 will not be included in modeling and analysis. Let’s check how many subjects had more than one visit recorded in this dataset:

# visualize distribution of # cognitive assessments per subject
p_subj_long <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  # fct_reorder --> arrange subject on x-axis by number of exams, from large to small
  ggplot(., aes(x=fct_reorder(RID, n_exams, .desc=T), y=n_exams, label=RID)) +
  geom_bar(stat="identity", aes(fill=n_exams, color=n_exams)) +
  labs(fill="Count", color="Count") +
  ggtitle("Number of ADNI Visits per Subject") +
  ylab("Number of Visits") +
  xlab("Subjects") +
  theme(axis.text.x=element_blank(),
        plot.title=element_text(hjust=0.5)) 

# convert to interactive plotly histogram
ggplotly(p_subj_long, tooltip = c("y"))
rm(p_subj_long)

Unlike with the PET data, most subjects have two or more visits recorded with cognitive and demographic information. The subjects in this dataset have up to 21 CDR-Sum of Boxes scores. To examine precisely how many subjects have at least two visits recorded:

num.subj <- subj.info %>%
  mutate(RID=as.character(RID)) %>%
  group_by(RID) %>%
  summarise(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  summarise(num_subjects=n(),
            total_exams=sum(n_exams))
cat("Number of subjects with at least two ADNI visits: **", 
    num.subj$num_subjects, "**\n", 
    "\nTotal number of longitudinal ADNI visits recorded: **", 
    num.subj$total_exams, "**\n", sep="")

Number of subjects with at least two ADNI visits: 2027

Total number of longitudinal ADNI visits recorded: 14571

There are 2027 subjects with two or more ADNI visits in this dataset. This should include all subjects and visit dates included in the tau-PET dataset, which will be confirmed upon merging the datasets in the Data Preparation stage.

Temporal distribution

It’s also worth checking the distribution of time interval between ADNI visits in these 2027 subjects with two or more visits:

# Plot # years between consecutive cognitive assessments by subject
p.subj.interval <- subj.info %>%
  select(RID, EXAMDATE) %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  arrange(EXAMDATE) %>%
  # Use lag to calculate time interval between exams
  mutate(Years_between_ADNI = 
           as.numeric((EXAMDATE - lag(EXAMDATE, 
                                      default = EXAMDATE[1]))/365)) %>%
  # Omit first scan per subject, for which the time interval is zero
  filter(Years_between_ADNI>0) %>%
  ggplot(., aes(x=Years_between_ADNI)) +
  geom_histogram(stat="count", fill="lightsteelblue", color="lightslategray") +
  ggtitle("Years in between ADNI visits per Subject") +
  ylab("Frequency") +
  xlab("# Years between two consecutive ADNI visits") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5)) 

# Convert to plotly interactive histogram
ggplotly(p.subj.interval)
rm(p.subj.interval)

Interestingly, there is a clear peak around 0.5 (six months) and a smaller peak around 1 (one year), indicating that most visits were spaced between 6 and 12 months apart. However, there is a positive skew to this distribution showing that a portion of subjects went up to five years in between visits. This is not likely to affect my analysis, as most tau-PET subjects had PET scans from 2018 onward, and would therefore have a follow-up interval of two years or less.

CDR-Sum of Boxes Scores

Now, I’ll look into the distribution of the target variable (CDRSB) and how they relate to other covariates, namely age and sex. These visualizations are filtered to show only those subjects with 2+ assessments.

# Plot CDR-SoB scores distribution across subjects with 2+ assessments
p.cdr.scores <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ylab("# of Occurences") +
  xlab("CDR-Sum of Boxes") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to plotly interactive histogram
ggplotly(p.cdr.scores)
rm(p.cdr.scores)

CDR-SoB by Age + Sex

Now, to stratify by age and sex, respectively:

# Violin plot by sex for CDR-SoB
p.cdr.sex.violin <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=PTGENDER, y=CDRSB)) +
  geom_violin(aes(fill=PTGENDER)) +
  theme_minimal() +
  ylab("CDR Sum of Boxes Score") +
  xlab("Biological Sex") +
  geom_signif(map_signif_level = F,
              test="t.test",
              comparisons=list(c("Female", "Male"))) +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

# CDR-SoB histogram by sex
p.cdr.sex.hist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ggplot(data=., mapping=aes(x=CDRSB)) +
  geom_histogram(aes(y=..count.., fill=PTGENDER)) +
  facet_wrap(PTGENDER ~ ., ncol=1) +
  theme_minimal() +
  ylab("Number of Subjects") +
  xlab("CDR Sum of Boxes Score") +
  ggtitle("Clinical Dementia Rating (CDR) Sum of Boxes Distribution by Sex") +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")


# Convert plots to interactive plotly visualizations
p.cdr.sex.violin <- ggplotly(p.cdr.sex.violin)
p.cdr.sex.hist <- ggplotly(p.cdr.sex.hist)

# Create plotly layout using subplot, keep x and y axes distinct
plotly::subplot(p.cdr.sex.violin, p.cdr.sex.hist, shareX=F, shareY=F,
                titleX=T, titleY=T) %>% 
  # Manually supply x- and y-axis titles
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "CDR Sum of Boxes Score", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         yaxis3 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)))

rm(p.cdr.sex.hist, p.cdr.sex.violin)

The distribution of CDR Sum of Boxes score looks similar between Females and Males by eye, but I’ll follow up with a t-test to confirm:

# t-test for CDR-SoB by sex
t.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 
t.test(CDRSB ~ PTGENDER, data=t.test.df)
rm(t.test.df)
## 
##  Welch Two Sample t-test
## 
## data:  CDRSB by PTGENDER
## t = -2.804, df = 9542.1, p-value = 0.005057
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.27228130 -0.04822416
## sample estimates:
## mean in group Female   mean in group Male 
##             2.000650             2.160903

In fact, there is actually a statistically significant (p<0.05) difference in CDR Sum of Boxes scores between males and females, with male subjects exhibiting an average score ~0.15 points higher than female subjects. This is an important consideration, and I will be sure to include sex as a covariate in prediction models where applicable.

Similarly, I’ll compare CDR Sum of Boxes with age:

# Plot CDR-SoB by age in scatter plot, only for subjects with 2+ cognitive assessments
p.age.cdr <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE, y=CDRSB)) +
  labs(color="Number of Points") +
  xlab("Age at Visit") +
  ylab("CDR Sum of Boxes") +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  geom_point(size=1, alpha=0.2, color="lightslategray", fill="lightslategray") +
  theme_minimal() +
  theme(plot.title=element_text(hjust=0.5),
        legend.position="none")

# Plot histogram of age distribution for subjects with 2+ cognitive assessments
p.age.dist <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() %>%
  ggplot(data=., mapping=aes(x=AGE)) +
  xlab("Number of Occurrences") +
  ylab("Age at Visit") +
  geom_histogram(aes(y=..count..), fill="lightsteelblue", color="lightslategray") +
  theme_minimal() +
  ggtitle("CDR Sum of Boxes vs. Age at Visit") +
  theme(plot.title=element_text(hjust=0.5))

p.age.cdr <- ggplotly(p.age.cdr)
p.age.dist <- ggplotly(p.age.dist)
# Create plotly layout using subplot, keep x and y axes distinct
plotly::subplot(p.age.cdr, p.age.dist, shareX=F, shareY=F,
                titleX=T, titleY=T, margin = 0.05) %>% 
  # Manually supply x- and y-axis titles
  layout(xaxis = list(title = "Biological Sex", 
                      titlefont = list(size = 12)), 
         xaxis2 = list(title = "Age at Visit", 
                       titlefont = list(size = 12)),
         yaxis=list(title="CDR Sum of Boxes Score", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Number of Subjects", 
                       titlefont = list(size = 12)),
         autosize = F, width = 800, height = 500)

rm(p.age.cdr, p.age.dist)

While there doesn’t appear to be any clear linear association between age at visit and CDR Sum of Boxes score, I’ll use cor.test to calculate the Pearson correlation coefficient and the corresponding p-value based on the correlation coefficient t-statistic:

# Correlation test between age and CDR-SoB for subjects with 2+ cognitive assessments
cor.test.df <- subj.info %>%
  group_by(RID) %>%
  mutate(n_exams=n()) %>%
  filter(n_exams>=2) %>%
  ungroup() 
cor.test(cor.test.df$AGE, cor.test.df$CDRSB)
## 
##  Pearson's product-moment correlation
## 
## data:  cor.test.df$AGE and cor.test.df$CDRSB
## t = 6.8375, df = 10395, p-value = 8.512e-12
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.04775209 0.08602444
## sample estimates:
##        cor 
## 0.06691288

Interestingly, this correlation coefficient is statistically significant (p=8.512e-12), but the effect size is very small (r=0.067). This significance seems to reflect the sheer size of the dataset rather than a strong relationship between age and CDR sum of boxes scores. Nevertheless, I will explore the use of age as a covariate in modeling later as this is a common practice.

Outlier detection

To better identify outliers based on this multivariate dataset, I will calculate Cook’s distance for each data point once the tau-PET data is merged with the cognitive status data in the Data Preparation stage.

Phase Three: Data Preparation

Tau-PET Data

I’ll filter this tau-PET data to contain only subjects with 2+ tau-PET scans, and omit irrelevant columns:

tau.df <- tau.df %>%
  # Omit irrelevant columns
  select(-VISCODE, -HEMIWM_SUVR, -BRAAK12_SUVR,
         -BRAAK34_SUVR, -BRAAK56_SUVR, -OTHER_SUVR) %>%
  # Don't include volumetric data columns
  select(!matches("VOLUME")) %>%
  group_by(RID) 

# remove _SUVR from column names
colnames(tau.df) <- str_replace_all(colnames(tau.df), "_SUVR", "")
str(tau.df)
## tibble [1,120 x 78] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ RID                            : int [1:1120] 21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE2                       : chr [1:1120] "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                       : Date[1:1120], format: "2018-02-02" "2018-04-24" ...
##  $ INFERIOR_CEREBGM               : num [1:1120] 1.32 1.33 1.33 1.28 1.24 ...
##  $ BRAIN_STEM                     : num [1:1120] 1.27 1.12 1.12 1.2 1.17 ...
##  $ LEFT_MIDDLEFR                  : num [1:1120] 2.02 1.93 1.8 1.83 1.78 ...
##  $ LEFT_ORBITOFR                  : num [1:1120] 2.17 2.03 1.92 2.11 1.98 ...
##  $ LEFT_PARSFR                    : num [1:1120] 2.02 2.01 1.98 2.03 1.99 ...
##  $ LEFT_ACCUMBENS_AREA            : num [1:1120] 1.14 1.04 1.79 1.12 1.18 ...
##  $ LEFT_AMYGDALA                  : num [1:1120] 1.31 1.54 1.63 1.42 1.37 ...
##  $ LEFT_CAUDATE                   : num [1:1120] 2.08 1.46 1.34 1.95 1.83 ...
##  $ LEFT_HIPPOCAMPUS               : num [1:1120] 2.12 1.96 2.2 1.69 1.73 ...
##  $ LEFT_PALLIDUM                  : num [1:1120] 3.79 1.89 1.95 2.5 2.6 ...
##  $ LEFT_PUTAMEN                   : num [1:1120] 1.69 1.64 1.42 1.9 1.78 ...
##  $ LEFT_THALAMUS_PROPER           : num [1:1120] 1.45 1.32 1.24 1.54 1.53 ...
##  $ RIGHT_MIDDLEFR                 : num [1:1120] 2.08 1.91 1.8 1.94 1.85 ...
##  $ RIGHT_ORBITOFR                 : num [1:1120] 2.19 2.01 1.86 2.17 2.03 ...
##  $ RIGHT_PARSFR                   : num [1:1120] 2.17 2.08 1.9 2.09 2.01 ...
##  $ RIGHT_ACCUMBENS_AREA           : num [1:1120] 1.41 1.65 1.66 1.01 1.07 ...
##  $ RIGHT_AMYGDALA                 : num [1:1120] 1.18 1.79 1.89 1.37 1.44 ...
##  $ RIGHT_CAUDATE                  : num [1:1120] 2.01 1.57 1.37 1.96 1.89 ...
##  $ RIGHT_HIPPOCAMPUS              : num [1:1120] 2.01 2.09 2.03 1.62 1.64 ...
##  $ RIGHT_PALLIDUM                 : num [1:1120] 3.01 2.32 2.12 2.33 2.48 ...
##  $ RIGHT_PUTAMEN                  : num [1:1120] 1.68 1.62 1.53 2.06 1.94 ...
##  $ RIGHT_THALAMUS_PROPER          : num [1:1120] 1.42 1.33 1.24 1.52 1.55 ...
##  $ CHOROID                        : num [1:1120] 7.45 4.56 4.31 3.84 3.79 ...
##  $ CTX_LH_BANKSSTS                : num [1:1120] 1.75 1.49 1.6 1.7 1.63 ...
##  $ CTX_LH_CAUDALANTERIORCINGULATE : num [1:1120] 1.67 1.73 1.65 1.69 1.69 ...
##  $ CTX_LH_CUNEUS                  : num [1:1120] 2.33 2.2 2.05 2.01 2 ...
##  $ CTX_LH_ENTORHINAL              : num [1:1120] 2.07 2.3 2.43 2.79 2.52 ...
##  $ CTX_LH_FUSIFORM                : num [1:1120] 1.97 1.87 1.83 1.84 1.77 ...
##  $ CTX_LH_INFERIORPARIETAL        : num [1:1120] 1.99 1.95 1.94 1.85 1.89 ...
##  $ CTX_LH_INFERIORTEMPORAL        : num [1:1120] 2.16 1.97 2.05 2.1 2.02 ...
##  $ CTX_LH_INSULA                  : num [1:1120] 1.51 1.64 1.65 1.51 1.48 ...
##  $ CTX_LH_ISTHMUSCINGULATE        : num [1:1120] 1.9 1.81 1.82 1.79 1.94 ...
##  $ CTX_LH_LATERALOCCIPITAL        : num [1:1120] 2.39 2.06 1.99 1.92 2 ...
##  $ CTX_LH_LINGUAL                 : num [1:1120] 2.27 1.95 1.97 1.76 1.74 ...
##  $ CTX_LH_MIDDLETEMPORAL          : num [1:1120] 2.2 2.06 1.89 2.04 1.99 ...
##  $ CTX_LH_PARACENTRAL             : num [1:1120] 1.99 1.79 1.8 1.91 1.8 ...
##  $ CTX_LH_PARAHIPPOCAMPAL         : num [1:1120] 1.6 1.86 1.92 1.72 1.66 ...
##  $ CTX_LH_PERICALCARINE           : num [1:1120] 2.23 1.45 1.41 1.56 1.54 ...
##  $ CTX_LH_POSTCENTRAL             : num [1:1120] 2.03 1.81 1.82 1.85 1.78 ...
##  $ CTX_LH_POSTERIORCINGULATE      : num [1:1120] 1.82 1.89 1.84 1.72 1.67 ...
##  $ CTX_LH_PRECENTRAL              : num [1:1120] 1.91 1.85 1.75 1.62 1.61 ...
##  $ CTX_LH_PRECUNEUS               : num [1:1120] 1.93 1.89 1.94 1.81 1.81 ...
##  $ CTX_LH_ROSTRALANTERIORCINGULATE: num [1:1120] 1.71 1.58 1.49 1.59 1.48 ...
##  $ CTX_LH_SUPERIORFRONTAL         : num [1:1120] 1.86 1.86 1.74 1.84 1.77 ...
##  $ CTX_LH_SUPERIORPARIETAL        : num [1:1120] 1.88 1.99 1.97 1.9 1.87 ...
##  $ CTX_LH_SUPERIORTEMPORAL        : num [1:1120] 2.06 1.92 1.83 1.85 1.76 ...
##  $ CTX_LH_SUPRAMARGINAL           : num [1:1120] 1.93 1.83 1.74 1.81 1.81 ...
##  $ CTX_LH_TEMPORALPOLE            : num [1:1120] 2.23 2.05 1.87 1.96 1.8 ...
##  $ CTX_LH_TRANSVERSETEMPORAL      : num [1:1120] 1.75 1.73 1.72 1.67 1.47 ...
##  $ CTX_RH_BANKSSTS                : num [1:1120] 1.78 1.85 1.68 1.68 1.65 ...
##  $ CTX_RH_CAUDALANTERIORCINGULATE : num [1:1120] 1.75 1.78 1.66 1.76 1.8 ...
##  $ CTX_RH_CUNEUS                  : num [1:1120] 2.45 2.25 2.19 2.13 2.11 ...
##  $ CTX_RH_ENTORHINAL              : num [1:1120] 1.99 3.74 3.46 2.52 2.34 ...
##  $ CTX_RH_FUSIFORM                : num [1:1120] 1.98 1.79 1.75 1.8 1.71 ...
##  $ CTX_RH_INFERIORPARIETAL        : num [1:1120] 1.96 1.93 1.82 1.81 1.82 ...
##  $ CTX_RH_INFERIORTEMPORAL        : num [1:1120] 2.25 1.97 1.81 2.01 1.98 ...
##  $ CTX_RH_INSULA                  : num [1:1120] 1.49 1.61 1.48 1.58 1.45 ...
##  $ CTX_RH_ISTHMUSCINGULATE        : num [1:1120] 2.03 1.86 1.85 1.85 1.81 ...
##  $ CTX_RH_LATERALOCCIPITAL        : num [1:1120] 2.27 1.9 1.89 1.86 1.92 ...
##  $ CTX_RH_LINGUAL                 : num [1:1120] 2.12 2.07 1.88 1.87 1.83 ...
##  $ CTX_RH_MIDDLETEMPORAL          : num [1:1120] 2.2 1.85 1.81 2 1.94 ...
##  $ CTX_RH_PARACENTRAL             : num [1:1120] 1.82 1.79 1.63 1.89 1.89 ...
##  $ CTX_RH_PARAHIPPOCAMPAL         : num [1:1120] 1.52 1.94 1.94 1.84 1.74 ...
##  $ CTX_RH_PERICALCARINE           : num [1:1120] 2.02 2.07 2 1.6 1.62 ...
##  $ CTX_RH_POSTCENTRAL             : num [1:1120] 1.91 1.9 1.74 1.81 1.76 ...
##  $ CTX_RH_POSTERIORCINGULATE      : num [1:1120] 1.79 1.79 1.68 1.68 1.77 ...
##  $ CTX_RH_PRECENTRAL              : num [1:1120] 1.72 1.87 1.77 1.71 1.71 ...
##  $ CTX_RH_PRECUNEUS               : num [1:1120] 1.86 1.82 1.75 1.85 1.86 ...
##  $ CTX_RH_ROSTRALANTERIORCINGULATE: num [1:1120] 1.78 1.63 1.57 1.52 1.49 ...
##  $ CTX_RH_SUPERIORFRONTAL         : num [1:1120] 1.86 1.9 1.81 1.82 1.79 ...
##  $ CTX_RH_SUPERIORPARIETAL        : num [1:1120] 1.78 2 1.9 1.93 1.92 ...
##  $ CTX_RH_SUPERIORTEMPORAL        : num [1:1120] 1.97 1.92 1.82 1.89 1.83 ...
##  $ CTX_RH_SUPRAMARGINAL           : num [1:1120] 1.8 1.81 1.76 1.8 1.79 ...
##  $ CTX_RH_TEMPORALPOLE            : num [1:1120] 2.29 2.3 2.21 1.98 1.75 ...
##  $ CTX_RH_TRANSVERSETEMPORAL      : num [1:1120] 1.96 2.14 1.91 1.62 1.55 ...
##  - attr(*, "groups")= tibble [776 x 2] (S3: tbl_df/tbl/data.frame)
##   ..$ RID  : int [1:776] 21 31 56 59 69 96 112 120 127 142 ...
##   ..$ .rows: list<int> [1:776] 
##   .. ..$ : int 1
##   .. ..$ : int [1:2] 2 3
##   .. ..$ : int [1:3] 4 5 6
##   .. ..$ : int 7
##   .. ..$ : int [1:3] 8 9 10
##   .. ..$ : int [1:3] 11 12 13
##   .. ..$ : int [1:3] 14 15 16
##   .. ..$ : int 17
##   .. ..$ : int 18
##   .. ..$ : int 19
##   .. ..$ : int 20
##   .. ..$ : int 21
##   .. ..$ : int 22
##   .. ..$ : int 23
##   .. ..$ : int [1:3] 24 25 26
##   .. ..$ : int 27
##   .. ..$ : int [1:2] 28 29
##   .. ..$ : int 30
##   .. ..$ : int 31
##   .. ..$ : int [1:2] 32 33
##   .. ..$ : int 34
##   .. ..$ : int 35
##   .. ..$ : int [1:2] 36 37
##   .. ..$ : int 38
##   .. ..$ : int [1:2] 39 40
##   .. ..$ : int [1:2] 41 42
##   .. ..$ : int 43
##   .. ..$ : int [1:2] 44 45
##   .. ..$ : int 46
##   .. ..$ : int 47
##   .. ..$ : int [1:3] 48 49 50
##   .. ..$ : int [1:2] 51 52
##   .. ..$ : int [1:3] 53 54 55
##   .. ..$ : int 56
##   .. ..$ : int [1:2] 57 58
##   .. ..$ : int 59
##   .. ..$ : int 60
##   .. ..$ : int 61
##   .. ..$ : int 62
##   .. ..$ : int 63
##   .. ..$ : int 64
##   .. ..$ : int 65
##   .. ..$ : int 66
##   .. ..$ : int 67
##   .. ..$ : int [1:4] 68 69 70 71
##   .. ..$ : int 72
##   .. ..$ : int [1:2] 73 74
##   .. ..$ : int [1:2] 75 76
##   .. ..$ : int 77
##   .. ..$ : int 78
##   .. ..$ : int [1:3] 79 80 81
##   .. ..$ : int [1:3] 82 83 84
##   .. ..$ : int 85
##   .. ..$ : int 86
##   .. ..$ : int 87
##   .. ..$ : int [1:2] 88 89
##   .. ..$ : int 90
##   .. ..$ : int 91
##   .. ..$ : int 92
##   .. ..$ : int [1:3] 93 94 95
##   .. ..$ : int [1:2] 96 97
##   .. ..$ : int 98
##   .. ..$ : int [1:2] 99 100
##   .. ..$ : int 101
##   .. ..$ : int 102
##   .. ..$ : int 103
##   .. ..$ : int 104
##   .. ..$ : int [1:5] 105 106 107 108 109
##   .. ..$ : int 110
##   .. ..$ : int [1:3] 111 112 113
##   .. ..$ : int [1:2] 114 115
##   .. ..$ : int 116
##   .. ..$ : int [1:4] 117 118 119 120
##   .. ..$ : int [1:2] 121 122
##   .. ..$ : int [1:4] 123 124 125 126
##   .. ..$ : int [1:3] 127 128 129
##   .. ..$ : int 130
##   .. ..$ : int 131
##   .. ..$ : int 132
##   .. ..$ : int 133
##   .. ..$ : int 134
##   .. ..$ : int [1:2] 135 136
##   .. ..$ : int 137
##   .. ..$ : int 138
##   .. ..$ : int [1:2] 139 140
##   .. ..$ : int 141
##   .. ..$ : int 142
##   .. ..$ : int 143
##   .. ..$ : int [1:3] 144 145 146
##   .. ..$ : int [1:2] 147 148
##   .. ..$ : int 149
##   .. ..$ : int [1:2] 150 151
##   .. ..$ : int 152
##   .. ..$ : int 153
##   .. ..$ : int 154
##   .. ..$ : int 155
##   .. ..$ : int 156
##   .. ..$ : int 157
##   .. ..$ : int 158
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

SUVR normalization

As shown in Data Understanding, the ROIs are not precisely standardized to the inferior cerebellum gray matter SUVR. I will re-standardize each region’s ROI SUVR values here:

# tau.stand = tau-PET dataframe with ROI SUVR values re-standardized to inferior cerebellum gray matter SUVR
tau.stand <- tau.df
# iterate over all ROI columns and standardize to inferior cerebellum gray -- tau.df[4]
for (i in 4:ncol(tau.stand)) {
  tau.stand[i] <- tau.stand[i]/ tau.df[4]
}
rm(tau.df)

Standardization can be verified using summary:

summary(tau.stand$INFERIOR_CEREBGM)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       1       1       1       1       1

ROI selection, a priori

Now that regional SUVR is properly standardized, the next step is to select brain regions based on a priori knowledge of where and how tau affects the brain in MCI/AD. I am going to stratify the cortical parcellations and subcortical segmentations based on Schöll et al. (2016) and per UCSF’s recommendations for usage of their tau-PET data. Here is the stratification across the Braak stages:

# Display the UCSF & Scholl 2016 ROI Braak stage datatable
roi.braak <- read.csv("https://raw.githubusercontent.com/anniegbryant/DA5030_Final_Project/master/RData/roi_braak_stages.csv") %>% 
  mutate(ROI_Name = tolower(ROI_Name)) %>%
  mutate(Hemisphere = ifelse(str_detect(ROI_Name, "rh_|right"), "Right", "Left"))
datatable(roi.braak %>% select(-Base_Name))

The following plots show the spatial relationship of the Braak stages in the brain, in the cortical (top) and subcortical (bottom) ROIs:

# Load dataframes that link the ADNI tau-PET ROIs with nomenclature used in the ggseg package
ggseg.aparc <- read.csv("https://raw.githubusercontent.com/anniegbryant/DA5030_Final_Project/master/RData/ggseg_aparc.csv")  %>%
  mutate(Braak=as.numeric(as.roman(Braak)))
ggseg.aseg <- read.csv("https://raw.githubusercontent.com/anniegbryant/DA5030_Final_Project/master/RData/ggseg_aseg.csv") %>%
  mutate(Braak=as.numeric(as.roman(Braak)))

# Plot the Desikan-Killiany cortical parcellation atlas
p.aparc <-  dk %>% filter(hemi=="right") %>% 
  unnest(ggseg) %>% 
  select(region) %>% 
  na.omit() %>% 
  distinct() %>%
  left_join(., ggseg.aparc, by=c("region"="ggseg_ROI")) %>%
  mutate(Braak = as.character(Braak)) %>%
  # ggseg: dk=Desikan-Killiany atlas, fill by Braak region, label by ROI name
  ggseg(atlas="dk", mapping=aes(fill=Braak, label=region)) +
  scale_fill_manual(values=c("#F8766D", "#A3A617", "#00BF7D",
                             "#00B0F6", "#E76BF3"), na.value="gray80") +
  theme(axis.text.y=element_blank(),
        axis.text.x = element_text(family="calibri"),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"))
# Convert to plotly interactive visualization
ggplotly(p.aparc, tooltip = c("fill", "label"), width=800, height=300)

# Plot the FreeSurfer subcortical segmentation atlas
p.aseg <- aseg %>% filter(hemi=="right") %>% 
  unnest(ggseg) %>% 
  select(region) %>% 
  na.omit() %>% 
  distinct() %>%
  left_join(., ggseg.aseg, by=c("region"="ggseg_ROI")) %>%
  mutate(Braak = as.character(Braak)) %>%
  # ggseg: aseg=subcortical segmentation atlas, fill by Braak region, label by ROI name
  ggseg(atlas="aseg", mapping=aes(fill=Braak, label=region)) +
  scale_fill_manual(values=c("deeppink", "#A3A617"), na.value="gray80") +
  theme(axis.text.y=element_blank(),
        axis.text.x = element_text(family="calibri"),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"))
# Convert to plotly interactive visualization
ggplotly(p.aseg, tooltip=c("fill", "label"), width=800, height=300)

I will filter the tau-PET dataset to only include SUVR data for ROIs detailed in the above list, by first reshaping the tau-PET SUVR data from wide to long. Then, I will merge left and right hemisphere ROIs into one bilateral ROI by taking the mean SUVR.

# ADNI tau-PET data includes other ROIs outside of this Braak stage dataset; for this study, I'm not looking at those
tau.stand.roi <- tau.stand %>%
  pivot_longer(., cols=c(-RID, -VISCODE2, -EXAMDATE), names_to="ROI_Name", values_to="SUVR") %>%
  mutate(ROI_Name=tolower(ROI_Name)) %>%
  # Only keep ROIs included in Braak stage stratifcation via semi-join
  semi_join(., roi.braak) %>%
  # Once dataset has been filtered, add columns containing Braak stage and cortical lobe
  left_join(., roi.braak) %>%
  # Remove right/left distinction from ROI name
  mutate(ROI_Name = str_replace_all(ROI_Name, "right_|left_|ctx_rh_|ctx_lh_", "")) %>%
  # Group by bilateral ROI -- e.g. "hippocampus" which contains left and right hippocampus
  dplyr::group_by(RID, VISCODE2, EXAMDATE, ROI_Name, Braak) %>%
  # Calculate ROI average SUVR
  dplyr::summarise(SUVR = mean(SUVR, na.rm=T))

This yields the following 31 distinct cortical ROIs:

data.frame(ROI=unique(tau.stand.roi$ROI_Name)) %>%
  left_join(., roi.braak, by=c("ROI"="Base_Name")) %>%
  select(-ROI_Name, -Hemisphere) %>%
  distinct() %>%
  datatable()

Final reshaping

Now, I will re-shape the tau-PET data back to wide to be compatible with the cognitive status data shape.

# Reshape wide --> long to be merged with CDR-SoB cognitive data
tau.stand.roi <- tau.stand.roi %>% 
  select(-Braak) %>%
  pivot_wider(id_cols=c(RID, VISCODE2, EXAMDATE), names_from="ROI_Name",
              values_from="SUVR")

str(tau.stand.roi)
## tibble [1,120 x 34] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ RID                     : int [1:1120] 21 31 31 56 56 56 59 69 69 69 ...
##  $ VISCODE2                : chr [1:1120] "m144" "m144" "m156" "m144" ...
##  $ EXAMDATE                : Date[1:1120], format: "2018-02-02" "2018-04-24" ...
##  $ amygdala                : num [1:1120] 0.943 1.254 1.327 1.09 1.127 ...
##  $ bankssts                : num [1:1120] 1.34 1.26 1.23 1.32 1.32 ...
##  $ caudalanteriorcingulate : num [1:1120] 1.29 1.32 1.25 1.35 1.4 ...
##  $ cuneus                  : num [1:1120] 1.81 1.67 1.6 1.62 1.65 ...
##  $ entorhinal              : num [1:1120] 1.54 2.28 2.22 2.07 1.95 ...
##  $ fusiform                : num [1:1120] 1.5 1.38 1.35 1.42 1.4 ...
##  $ hippocampus             : num [1:1120] 1.56 1.52 1.59 1.29 1.36 ...
##  $ inferiorparietal        : num [1:1120] 1.5 1.46 1.42 1.43 1.49 ...
##  $ inferiortemporal        : num [1:1120] 1.67 1.48 1.45 1.61 1.61 ...
##  $ insula                  : num [1:1120] 1.14 1.22 1.18 1.21 1.18 ...
##  $ isthmuscingulate        : num [1:1120] 1.49 1.38 1.38 1.42 1.51 ...
##  $ lateraloccipital        : num [1:1120] 1.77 1.49 1.46 1.47 1.58 ...
##  $ lingual                 : num [1:1120] 1.66 1.52 1.45 1.42 1.43 ...
##  $ middlefr                : num [1:1120] 1.55 1.44 1.36 1.48 1.46 ...
##  $ middletemporal          : num [1:1120] 1.67 1.47 1.39 1.58 1.58 ...
##  $ orbitofr                : num [1:1120] 1.65 1.52 1.43 1.67 1.61 ...
##  $ paracentral             : num [1:1120] 1.44 1.35 1.3 1.48 1.48 ...
##  $ parahippocampal         : num [1:1120] 1.18 1.43 1.45 1.39 1.37 ...
##  $ parsfr                  : num [1:1120] 1.59 1.54 1.46 1.61 1.61 ...
##  $ pericalcarine           : num [1:1120] 1.61 1.33 1.29 1.23 1.27 ...
##  $ postcentral             : num [1:1120] 1.49 1.4 1.34 1.43 1.42 ...
##  $ posteriorcingulate      : num [1:1120] 1.37 1.39 1.33 1.33 1.38 ...
##  $ precentral              : num [1:1120] 1.37 1.4 1.33 1.3 1.34 ...
##  $ precuneus               : num [1:1120] 1.43 1.4 1.39 1.43 1.48 ...
##  $ rostralanteriorcingulate: num [1:1120] 1.32 1.21 1.15 1.21 1.2 ...
##  $ superiorfrontal         : num [1:1120] 1.41 1.42 1.34 1.43 1.43 ...
##  $ superiorparietal        : num [1:1120] 1.39 1.5 1.46 1.49 1.52 ...
##  $ superiortemporal        : num [1:1120] 1.53 1.45 1.38 1.46 1.44 ...
##  $ supramarginal           : num [1:1120] 1.41 1.37 1.32 1.41 1.45 ...
##  $ temporalpole            : num [1:1120] 1.71 1.64 1.54 1.54 1.43 ...
##  $ transversetemporal      : num [1:1120] 1.41 1.46 1.37 1.28 1.21 ...
##  - attr(*, "groups")= tibble [1,120 x 4] (S3: tbl_df/tbl/data.frame)
##   ..$ RID     : int [1:1120] 21 31 31 56 56 56 59 69 69 69 ...
##   ..$ VISCODE2: chr [1:1120] "m144" "m144" "m156" "m144" ...
##   ..$ EXAMDATE: Date[1:1120], format: "2018-02-02" "2018-04-24" ...
##   ..$ .rows   : list<int> [1:1120] 
##   .. ..$ : int 1
##   .. ..$ : int 2
##   .. ..$ : int 3
##   .. ..$ : int 4
##   .. ..$ : int 5
##   .. ..$ : int 6
##   .. ..$ : int 7
##   .. ..$ : int 8
##   .. ..$ : int 9
##   .. ..$ : int 10
##   .. ..$ : int 11
##   .. ..$ : int 12
##   .. ..$ : int 13
##   .. ..$ : int 14
##   .. ..$ : int 15
##   .. ..$ : int 16
##   .. ..$ : int 17
##   .. ..$ : int 18
##   .. ..$ : int 19
##   .. ..$ : int 20
##   .. ..$ : int 21
##   .. ..$ : int 22
##   .. ..$ : int 23
##   .. ..$ : int 24
##   .. ..$ : int 25
##   .. ..$ : int 26
##   .. ..$ : int 27
##   .. ..$ : int 28
##   .. ..$ : int 29
##   .. ..$ : int 30
##   .. ..$ : int 31
##   .. ..$ : int 32
##   .. ..$ : int 33
##   .. ..$ : int 34
##   .. ..$ : int 35
##   .. ..$ : int 36
##   .. ..$ : int 37
##   .. ..$ : int 38
##   .. ..$ : int 39
##   .. ..$ : int 40
##   .. ..$ : int 41
##   .. ..$ : int 42
##   .. ..$ : int 43
##   .. ..$ : int 44
##   .. ..$ : int 45
##   .. ..$ : int 46
##   .. ..$ : int 47
##   .. ..$ : int 48
##   .. ..$ : int 49
##   .. ..$ : int 50
##   .. ..$ : int 51
##   .. ..$ : int 52
##   .. ..$ : int 53
##   .. ..$ : int 54
##   .. ..$ : int 55
##   .. ..$ : int 56
##   .. ..$ : int 57
##   .. ..$ : int 58
##   .. ..$ : int 59
##   .. ..$ : int 60
##   .. ..$ : int 61
##   .. ..$ : int 62
##   .. ..$ : int 63
##   .. ..$ : int 64
##   .. ..$ : int 65
##   .. ..$ : int 66
##   .. ..$ : int 67
##   .. ..$ : int 68
##   .. ..$ : int 69
##   .. ..$ : int 70
##   .. ..$ : int 71
##   .. ..$ : int 72
##   .. ..$ : int 73
##   .. ..$ : int 74
##   .. ..$ : int 75
##   .. ..$ : int 76
##   .. ..$ : int 77
##   .. ..$ : int 78
##   .. ..$ : int 79
##   .. ..$ : int 80
##   .. ..$ : int 81
##   .. ..$ : int 82
##   .. ..$ : int 83
##   .. ..$ : int 84
##   .. ..$ : int 85
##   .. ..$ : int 86
##   .. ..$ : int 87
##   .. ..$ : int 88
##   .. ..$ : int 89
##   .. ..$ : int 90
##   .. ..$ : int 91
##   .. ..$ : int 92
##   .. ..$ : int 93
##   .. ..$ : int 94
##   .. ..$ : int 95
##   .. ..$ : int 96
##   .. ..$ : int 97
##   .. ..$ : int 98
##   .. ..$ : int 99
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

Merging datasets

subj.info <- subj.info %>% select(RID, VISCODE, AGE, PTGENDER, CDRSB)

I actually can’t join the two datasets on the EXAMDATE feature, as these sometimes differ by one or two days depending on when the records were entered. Instead, I will join by the RID subject identifier and VISCODE, a visit code identifier.

# full.df = merged tau-PET data and cognitive assessment data
full.df <- inner_join(tau.stand.roi, subj.info, by=c("RID", "VISCODE2"="VISCODE"))  %>%
  filter(!is.na(CDRSB)) %>%
  group_by(RID) %>%
  dplyr::mutate(n_visits = n()) %>%
  # Only keep subjects with 2+ PET scans and cognitive assessments
  filter(n_visits>1) %>%
  select(-n_visits)

Here’s the structure of this merged dataset:

str(full.df)
## tibble [576 x 37] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ RID                     : int [1:576] 31 31 56 56 56 69 69 69 96 96 ...
##  $ VISCODE2                : chr [1:576] "m144" "m156" "m144" "m156" ...
##  $ EXAMDATE                : Date[1:576], format: "2018-04-24" "2019-04-23" ...
##  $ amygdala                : num [1:576] 1.25 1.33 1.09 1.13 1.05 ...
##  $ bankssts                : num [1:576] 1.26 1.23 1.32 1.32 1.3 ...
##  $ caudalanteriorcingulate : num [1:576] 1.32 1.25 1.35 1.4 1.21 ...
##  $ cuneus                  : num [1:576] 1.67 1.6 1.62 1.65 1.56 ...
##  $ entorhinal              : num [1:576] 2.28 2.22 2.07 1.95 1.98 ...
##  $ fusiform                : num [1:576] 1.38 1.35 1.42 1.4 1.34 ...
##  $ hippocampus             : num [1:576] 1.52 1.59 1.29 1.36 1.34 ...
##  $ inferiorparietal        : num [1:576] 1.46 1.42 1.43 1.49 1.45 ...
##  $ inferiortemporal        : num [1:576] 1.48 1.45 1.61 1.61 1.58 ...
##  $ insula                  : num [1:576] 1.22 1.18 1.21 1.18 1.08 ...
##  $ isthmuscingulate        : num [1:576] 1.38 1.38 1.42 1.51 1.36 ...
##  $ lateraloccipital        : num [1:576] 1.49 1.46 1.47 1.58 1.59 ...
##  $ lingual                 : num [1:576] 1.52 1.45 1.42 1.43 1.38 ...
##  $ middlefr                : num [1:576] 1.44 1.36 1.48 1.46 1.42 ...
##  $ middletemporal          : num [1:576] 1.47 1.39 1.58 1.58 1.56 ...
##  $ orbitofr                : num [1:576] 1.52 1.43 1.67 1.61 1.5 ...
##  $ paracentral             : num [1:576] 1.35 1.3 1.48 1.48 1.41 ...
##  $ parahippocampal         : num [1:576] 1.43 1.45 1.39 1.37 1.29 ...
##  $ parsfr                  : num [1:576] 1.54 1.46 1.61 1.61 1.5 ...
##  $ pericalcarine           : num [1:576] 1.33 1.29 1.23 1.27 1.22 ...
##  $ postcentral             : num [1:576] 1.4 1.34 1.43 1.42 1.34 ...
##  $ posteriorcingulate      : num [1:576] 1.39 1.33 1.33 1.38 1.25 ...
##  $ precentral              : num [1:576] 1.4 1.33 1.3 1.34 1.25 ...
##  $ precuneus               : num [1:576] 1.4 1.39 1.43 1.48 1.37 ...
##  $ rostralanteriorcingulate: num [1:576] 1.21 1.15 1.21 1.2 1.06 ...
##  $ superiorfrontal         : num [1:576] 1.42 1.34 1.43 1.43 1.35 ...
##  $ superiorparietal        : num [1:576] 1.5 1.46 1.49 1.52 1.43 ...
##  $ superiortemporal        : num [1:576] 1.45 1.38 1.46 1.44 1.37 ...
##  $ supramarginal           : num [1:576] 1.37 1.32 1.41 1.45 1.37 ...
##  $ temporalpole            : num [1:576] 1.64 1.54 1.54 1.43 1.47 ...
##  $ transversetemporal      : num [1:576] 1.46 1.37 1.28 1.21 1.06 ...
##  $ AGE                     : num [1:576] 77.7 77.7 69.6 69.6 69.6 72.9 72.9 72.9 79.6 79.6 ...
##  $ PTGENDER                : Factor w/ 2 levels "Female","Male": 1 1 1 1 1 2 2 2 2 2 ...
##  $ CDRSB                   : num [1:576] 0 0 0.5 0.5 0 0.5 0 0.5 0 0 ...
##  - attr(*, "groups")= tibble [243 x 2] (S3: tbl_df/tbl/data.frame)
##   ..$ RID  : int [1:243] 31 56 69 96 112 377 416 467 618 668 ...
##   ..$ .rows: list<int> [1:243] 
##   .. ..$ : int [1:2] 1 2
##   .. ..$ : int [1:3] 3 4 5
##   .. ..$ : int [1:3] 6 7 8
##   .. ..$ : int [1:3] 9 10 11
##   .. ..$ : int [1:3] 12 13 14
##   .. ..$ : int [1:3] 15 16 17
##   .. ..$ : int [1:2] 18 19
##   .. ..$ : int [1:2] 20 21
##   .. ..$ : int [1:2] 22 23
##   .. ..$ : int [1:2] 24 25
##   .. ..$ : int [1:2] 26 27
##   .. ..$ : int [1:2] 28 29
##   .. ..$ : int [1:3] 30 31 32
##   .. ..$ : int [1:2] 33 34
##   .. ..$ : int [1:3] 35 36 37
##   .. ..$ : int [1:2] 38 39
##   .. ..$ : int [1:4] 40 41 42 43
##   .. ..$ : int [1:2] 44 45
##   .. ..$ : int [1:3] 46 47 48
##   .. ..$ : int [1:3] 49 50 51
##   .. ..$ : int [1:3] 52 53 54
##   .. ..$ : int [1:2] 55 56
##   .. ..$ : int [1:2] 57 58
##   .. ..$ : int [1:4] 59 60 61 62
##   .. ..$ : int [1:3] 63 64 65
##   .. ..$ : int [1:2] 66 67
##   .. ..$ : int [1:4] 68 69 70 71
##   .. ..$ : int [1:2] 72 73
##   .. ..$ : int [1:4] 74 75 76 77
##   .. ..$ : int [1:3] 78 79 80
##   .. ..$ : int [1:2] 81 82
##   .. ..$ : int [1:2] 83 84
##   .. ..$ : int [1:3] 85 86 87
##   .. ..$ : int [1:2] 88 89
##   .. ..$ : int [1:2] 90 91
##   .. ..$ : int [1:3] 92 93 94
##   .. ..$ : int [1:2] 95 96
##   .. ..$ : int [1:4] 97 98 99 100
##   .. ..$ : int [1:2] 101 102
##   .. ..$ : int [1:2] 103 104
##   .. ..$ : int [1:2] 105 106
##   .. ..$ : int [1:2] 107 108
##   .. ..$ : int [1:3] 109 110 111
##   .. ..$ : int [1:3] 112 113 114
##   .. ..$ : int [1:4] 115 116 117 118
##   .. ..$ : int [1:2] 119 120
##   .. ..$ : int [1:2] 121 122
##   .. ..$ : int [1:2] 123 124
##   .. ..$ : int [1:2] 125 126
##   .. ..$ : int [1:3] 127 128 129
##   .. ..$ : int [1:3] 130 131 132
##   .. ..$ : int [1:2] 133 134
##   .. ..$ : int [1:2] 135 136
##   .. ..$ : int [1:2] 137 138
##   .. ..$ : int [1:2] 139 140
##   .. ..$ : int [1:2] 141 142
##   .. ..$ : int [1:3] 143 144 145
##   .. ..$ : int [1:2] 146 147
##   .. ..$ : int [1:2] 148 149
##   .. ..$ : int [1:3] 150 151 152
##   .. ..$ : int [1:4] 153 154 155 156
##   .. ..$ : int [1:2] 157 158
##   .. ..$ : int [1:2] 159 160
##   .. ..$ : int [1:2] 161 162
##   .. ..$ : int [1:3] 163 164 165
##   .. ..$ : int [1:2] 166 167
##   .. ..$ : int [1:3] 168 169 170
##   .. ..$ : int [1:2] 171 172
##   .. ..$ : int [1:2] 173 174
##   .. ..$ : int [1:3] 175 176 177
##   .. ..$ : int [1:2] 178 179
##   .. ..$ : int [1:2] 180 181
##   .. ..$ : int [1:3] 182 183 184
##   .. ..$ : int [1:2] 185 186
##   .. ..$ : int [1:3] 187 188 189
##   .. ..$ : int [1:3] 190 191 192
##   .. ..$ : int [1:2] 193 194
##   .. ..$ : int [1:3] 195 196 197
##   .. ..$ : int [1:4] 198 199 200 201
##   .. ..$ : int [1:2] 202 203
##   .. ..$ : int [1:2] 204 205
##   .. ..$ : int [1:2] 206 207
##   .. ..$ : int [1:2] 208 209
##   .. ..$ : int [1:2] 210 211
##   .. ..$ : int [1:2] 212 213
##   .. ..$ : int [1:2] 214 215
##   .. ..$ : int [1:2] 216 217
##   .. ..$ : int [1:2] 218 219
##   .. ..$ : int [1:4] 220 221 222 223
##   .. ..$ : int [1:2] 224 225
##   .. ..$ : int [1:4] 226 227 228 229
##   .. ..$ : int [1:2] 230 231
##   .. ..$ : int [1:2] 232 233
##   .. ..$ : int [1:3] 234 235 236
##   .. ..$ : int [1:3] 237 238 239
##   .. ..$ : int [1:2] 240 241
##   .. ..$ : int [1:2] 242 243
##   .. ..$ : int [1:4] 244 245 246 247
##   .. ..$ : int [1:3] 248 249 250
##   .. .. [list output truncated]
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
cat("\nNumber of longitudinal tau-PET scans with accompanying cognitive data: **\n",
    nrow(full.df), "**\nNumber of subjects in merged dataset: **", 
    length(unique(full.df$RID)), "**\n", "\n", sep="")

Number of longitudinal tau-PET scans with accompanying cognitive data: 576 Number of subjects in merged dataset: 243

As it turns out, only 576 of the original 593 tau-PET scans had corresponding cognitive assessments. This leaves 576 unique PET scan datapoints for 243 subjects.

Rate of change calculation

Lastly, before I can perform outlier detection, I need to derive the longitudinal features upon which the prediction models will be built – namely, annual change in tau-PET SUVR and annual change in CDR-Sum of Boxes score.

# Calculate change in tau-PET SUVR and CDR-SoB over time, normalized to # of years elapsed
annual.changes <- full.df %>%
  ungroup() %>%
  select(-VISCODE2) %>%
  # Reshape wide --> long
  pivot_longer(cols=c(-RID, -EXAMDATE, -PTGENDER, -AGE), names_to="Metric",
               values_to="Value") %>%
  # Calculate number of years between each visit as well as change in SUVR or CDR-SoB
  dplyr::group_by(RID, PTGENDER, Metric) %>%
  dplyr::summarise(n_years = as.numeric((EXAMDATE - lag(EXAMDATE, 
                                                        default=EXAMDATE[1]))/365),
                   change = Value - lag(Value, default=Value[1]),
                   Age_Baseline = AGE[n_years==0]) %>%
  # Remove data points corresponding to first visit
  filter(n_years > 0) %>%
  # Calculate change in either tau-PET SUVR or CDR-SoB change per year
  dplyr::mutate(Annual_Change = change/n_years) %>%
  # Remove columns that are no longer needed
  select(-n_years, -change) %>%
  group_by(RID, Metric) %>%
  # Assign row identifier with row_number()
  dplyr::mutate(interval_num = row_number()) %>%
  # Reshape from long --> wide
  pivot_wider(., id_cols=c(RID, Age_Baseline, PTGENDER, interval_num), names_from=Metric,
              values_from=Annual_Change) %>%
  rename("Sex" = "PTGENDER")
datatable(annual.changes[1:5])

Outlier detection

Now that the datasets are merged, I can perform outlier detection. Given the multivariate nature of this dataset (i.e. multiple brain regions), I will use Cook’s Distance to estimate the relative influence of each data point in a simple multiple regression model.

# Calculate Cook's Distance using multiple regression of CDR-SoB annual change on tau-PET SUVR change for all 31 ROIs
cooks.distance <- cooks.distance(lm(CDRSB ~ . - RID - interval_num, data=annual.changes))

# Plot Cook's distance for each subject
p.cooks <- data.frame(CD=cooks.distance) %>%
  rownames_to_column(var="Data_Point") %>%
  mutate(Data_Point=as.numeric(Data_Point)) %>%
  mutate(Label=ifelse(CD>30, Data_Point, NA_real_)) %>%
  ggplot(data=., mapping=aes(x=Data_Point,y=CD)) +
  geom_hline(yintercept = 4*mean(cooks.distance,na.rm=T), color="blue") +
  geom_point() +
  geom_text(aes(label=Label), nudge_y=1.5) +
  ylab("Cook's Distance") +
  xlab("annual.changes index") +
  theme_minimal()

# Convert to interactive plotly visualization
ggplotly(p.cooks)
rm(p.cooks)

All but one data point have relatively low Cook’s distance values, while data point #224 has a relatively large Cook’s distance. This suggests large residuals and leverage associated with this datapoint, which could distort model fitting and accuracy. Upon further examination of this instance:

# Show data from row 224
as.data.frame(t(annual.changes[224,])) %>%
  rownames_to_column(var="Variable") %>%
  dplyr::rename("Value" = "V1") %>%
  datatable()

This subject exhibits very large fluctuations in tau-PET SUVR values in several brain regions for this associated time interval. Given that SUVR values typically range from 0.75-2, changes of this large magnitude is surprising, and may certainly render this data point an outlier. Fortunately, the interval_num of 2 indicates that this is the second time interval for this subject, so omitting this interval doesn’t reduce the total number of subjects in the analysis. I will remove this data point:

# Omit row 224
annual.changes <- annual.changes[-224,]

Tau-PET annual change across ROIs

I can now finish some aspects of data exploration that depended upon refining the subject cohort as well as the features. For starters, I will examine the distribution of annual tau change in each of the 31 ROIs:

# reshape tau-PET SUVR change ROI-wise data from wide --> long
annual.changes.tau <- annual.changes %>%
  select(-CDRSB) %>%
  ungroup() %>%
  pivot_longer(cols=c(-RID, -interval_num, -Age_Baseline, -Sex), names_to="ROI",
               values_to="deltaSUVR")

# Plot SUVR change distribution for each ROI using multi.hist from psych package
multi.hist(annual.changes %>% ungroup() %>% select(-RID, -interval_num, -CDRSB, -Age_Baseline, -Sex),
           dcol="red")

The distribution looks reasonably normal for each ROI, and all of the curves peak around zero, suggesting all of the ROIs have a mean of ~0. Since there are both negative values and values of zero in these data, neither log nor square root transformation would be possible, anyway. Therefore, I will leave the variable distribution as-is.

Next, I will visualize the correlation in annual tau change between each of the ROIs measured:

# Select only tau-PET ROIs from annual change dataset
annual.roi <- annual.changes %>% ungroup() %>% select(-RID, -interval_num, -CDRSB, -Age_Baseline, -Sex)
# Calculate correlation matrix between all ROIs
roi.cor <- cor(annual.roi)

# Plot the correlation matrix using ggcorrplot, order by hierarchical clustering
ggcorrplot(roi.cor, hc.order = TRUE, outline.col = "white") %>% 
  # Convert to interactive correlogram with plotly
  ggplotly(width=800, height=700)

As it turns out, all ROIs show positive correlations in the annual rate of change in tau-PET uptake, with the exception of three ROI pairs:

  • entorhinal and pericalcarine (R=-0.14)
  • entorhinal and transverse temporal (R=-0.07)
  • transverse temporal and lateral occipital (R=-0.07)

These are very weak correlations, and can be further visualized with scatter plots:

# Highlight the ROIs noted above with (slightly) negative correlations
# Use ggpairs function from GGally to visualize their pairwise distributions
p.select.rois <- ggpairs(annual.roi %>% select(entorhinal, pericalcarine, 
                                               transversetemporal,
                                               lateraloccipital),
                         # Add regression line to scatter plots
                         lower = list(continuous = wrap("smooth", se=F, 
                                                        method = "lm", 
                                                        color="lightslategray",
                                                        alpha=0.4))) +
  theme_minimal()

# Convert to interactive pair plot with plotly
ggplotly(p.select.rois, width=700, height=600)

These negative correlations are indeed weak and mostly noise, based on the scatter plots. Regarding the other positive correlations, I am curious as to whether there are underlying trends based on either spatial proximity and/or tau progression in AD, based on cortical lobe and Braak regions, respectively:

# Convert correlation matrix from wide --> long, each row will detail one pairwise correlation
roi.cor.long <-  as.data.frame(roi.cor) %>%
  rownames_to_column(var="ROI1") %>%
  pivot_longer(cols=c(-ROI1), names_to="ROI2", values_to="Pearson_Corr") %>%
  # Exclude rows where the two ROIs are the same, where correlation is always 1
  filter(ROI1 != ROI2) %>%
  # Join with Braak-stage stratification dataframe by the first ROI column
  left_join(., roi.braak, by=c("ROI1"="Base_Name")) %>%
  select(-ROI_Name, -Hemisphere) %>%
  # Specify that Braak + Cortex info pertain to the first ROI column, not the second
  dplyr::rename("ROI1_Braak" = "Braak", "ROI1_Cortex" = "Cortex") %>%
  # Merge again with Braak-stage stratification, this time by the second ROI column
  left_join(., roi.braak, by=c("ROI2" = "Base_Name")) %>%
  select(-ROI_Name, -Hemisphere) %>%
  dplyr::rename("ROI2_Braak" = "Braak", "ROI2_Cortex" = "Cortex") %>%
  # Rename Insula as Ins for visualization purpose
  mutate_at(c("ROI1_Cortex", "ROI2_Cortex"), function(x) ifelse(x=="Insula", "Ins", x))

# Plot the correlation by Cortical Lobe
p.cor.cortical <- roi.cor.long %>%
  ggplot(., mapping=aes(x=ROI1, y=ROI2)) +
  # Heatmap, fill squares by pearson correlation coefficient
  geom_tile(mapping=aes(fill=Pearson_Corr)) +
  labs(fill="Pearson Coefficient") +
  theme_minimal() +
  # Facet by cortical lobes
  facet_grid(ROI2_Cortex ~ ROI1_Cortex, scales="free", 
             space="free", switch="both") +
  ggtitle("Correlation in Annual Tau SUVR Change by Cortical Lobe") +
  theme(axis.title=element_blank(),
        axis.text.y = element_text(size=11),
        axis.text.x = element_text(angle=90, size=11, hjust=1),
        panel.border = element_blank(), 
        panel.grid = element_blank(),
        plot.title=element_text(hjust=0.5)) +
  theme(strip.placement = "outside") +
  scale_fill_gradient2(low="#210DFC", mid="white", high="#FF171B",
                       limits=c(-1,1))

# Save correlation matrix to PNG and print -- dimensions were bizarre if I plotted directly from ggplot
ggsave("ROI_Correlation_Cortical.png", plot=p.cor.cortical, width=9, height=7.5, units="in", dpi=300)
include_graphics("ROI_Correlation_Cortical.png")

There are somewhat stronger inter-correlations within the frontal and parietal cortices compared with other cortical lobes. Now stratifying based on ROI Braak stage:

# Plot the correlation by Braak Stage
p.cor.braak <- roi.cor.long %>%
  ggplot(., mapping=aes(x=ROI1, y=ROI2)) +
  # Heatmap, fill squares by pearson correlation coefficient
  geom_tile(mapping=aes(fill=Pearson_Corr)) +
  labs(fill="Pearson Coefficient") +
  theme_minimal() +
  # Facet by Braak stage
  facet_grid(ROI2_Braak ~ ROI1_Braak, scales="free", 
             space="free", switch="both") +
  ggtitle("Correlation in Annual Tau SUVR Change by Braak Stage") +
  theme(axis.title=element_blank(),
        axis.text.y = element_text(size=11),
        axis.text.x = element_text(angle=90, size=11, hjust=1),
        panel.border = element_blank(), 
        panel.grid = element_blank()) +
  theme(strip.placement = "outside") +
  scale_fill_gradient2(low="#210DFC", mid="white", high="#FF171B",
                       limits=c(-1,1))

# Save correlation matrix to PNG and print -- dimensions were bizarre if I plotted directly from ggplot
ggsave("ROI_Correlation_Braak.png", plot=p.cor.braak, width=9, height=7.5, units="in", dpi=300)
include_graphics("ROI_Correlation_Braak.png")

 

This generally high correlation in annual tau-PET SUVR changes between cortical regions may pose a challenge when it comes to modeling, due to feature collinearity.

Principal component analysis (PCA)

While I want to keep each region distinct for biological context, I will also reduce the dimensionality of the data using principal component analysis (PCA), which has an added benefit of yielding orthogonal un-correlated components to serve as input for the modeling phase. Since all the variables are in the same unit (i.e. change in SUVR per year), I will only need to center the data, not scale it.

# Convert dataframe to numerical matrix
pca.df <- as.matrix(annual.changes %>% ungroup() %>% select(-RID, -CDRSB, -interval_num, -Age_Baseline, -Sex))

# Perform PCA with prcomp() from stats package
# Center data but don't scale, as all columns are in same units (change in SUVR per year)
res.pca <- prcomp(pca.df, center=T, scale.=F)

# The variable info can be extracted as follows:
var <- get_pca_var(res.pca)

The proportion of variance explained by each principal component (PC) can be visualized using a Scree plot:

# Calculate cumulative proportion of variance explained
cumpro <- cumsum(res.pca$sdev^2 / sum(res.pca$sdev^2)*100)
# Calculate individual proportion of variance explained by each principal component
variances <- data.frame((res.pca$sdev^2/sum(res.pca$sdev^2))*100)
# Label PCs
variances$PC <- c(1:31)
# Add in cumulative proportion of variance to dataframe
variances$cumpro <- cumpro
# Establish column names
colnames(variances) <- c("Variance_Proportion", "PC", "CumVar")

# Create a new row just for visualization purpose, helps with axis structure
newrow <- subset(variances, PC == 31)
newrow$PC <- 31.5
variances <- plyr::rbind.fill(variances, newrow)

# Set individual variance line as maroon, cumulative variance line as green
linecolors <- c("Component Variance" = "maroon4",
                "Cumulative Variance" = "green4")

# Plot the customized Scree plot
p.var <- variances %>%
  ggplot(data=.) +
  # Add a bar for each principal component, showing individual proportion of variance
  geom_bar(data=subset(variances, PC < 32), mapping=aes(x=PC, y=Variance_Proportion),
           stat="identity", fill="steelblue") +
  # Add line for individual component variance
  geom_line(aes(color="Component Variance", x=PC, y=Variance_Proportion), 
            size=0.7, data=subset(variances, PC < 31.1), show.legend=F) +
  # Add line for cumulative variance explained with each additional principal component
  geom_line(aes(x=PC, y=CumVar, color="Cumulative Variance"), size=0.7,
            data=subset(variances, PC < 32)) +
  geom_point(aes(x=PC, y=Variance_Proportion),data=subset(variances, PC < 31.1), size=1.5) +
  # Set line colors as defined above
  scale_colour_manual(name="",values=linecolors, 
                      guide = guide_legend(override.aes=list(size=2))) + 
  theme_minimal() +
  ylab("Percentage of Variance Explained") +
  xlab("Principal Component") +
  ggtitle("Principal Components\nContribution to Subject Variance") +
  # Manually define x-axis and y-axis limits so line aligns with bar plot
  xlim(c(0.5, 31.5)) +
  scale_y_continuous(breaks=seq(0, 100, 10),
                     sec.axis = dup_axis(name="")) +
  theme(plot.title = element_text(hjust=0.5, size=14),
        axis.text = element_text(size=12),
        panel.grid = element_blank(),
        legend.position="bottom",
        legend.text = element_text(size=12))

# Convert to plotly interactive visualization
ggplotly(p.var, tooltip=c("x","y")) %>% 
  layout(legend = list(orientation = "h", y=-0.2))

The first five principal components (PCs) collectively explain 77.2% of variance in the data; beyond these components, there are only marginal increases in the cumulative variance explained. Therefore, I will move forward with these first five PCs.

Individual ROI contributions (loadings) per component can be extracted:

# variable loadings are stored in the $rotation vector of prcomp output
loadings_wide <- data.frame(res.pca$rotation) %>%
  # add rownames as column
  cbind(ROI=rownames(.), .) %>%
  # remove original row names
  remove_rownames() %>% 
  # Select ROI and first five PCs
  select(ROI:PC5) %>%
  rowwise() %>%
  # Join with Braak stage stratification dataframe
  left_join(., roi.braak, by=c("ROI"="Base_Name")) %>%
  select(ROI, Cortex, Braak, PC1:PC5) %>%
  distinct()

# Print interactive datatable
datatable(loadings_wide %>% mutate_if(is.numeric, function(x) round(x,4)))

I’m curious as to whether ROIs exhibit similar covariance in annual tau-PET changes based on spatial proximity (i.e. cortical region) and/or similar Alzheimer’s Disease progression (i.e. Braak stage).

# Plot loadings colored by cortical lobe
p.cortex <- loadings_wide %>%
  ggplot(data=., mapping=aes(x=PC1, y=PC2, label=ROI)) +
  geom_hline(yintercept=0, linetype=2, alpha=0.5) +
  geom_vline(xintercept=0, linetype=2, alpha=0.5) +
  geom_point(aes(color=Cortex), size=3) +
  theme_minimal() +
  xlab("PC1 (41.6% Variance)") +
  ylab("PC2 (14.0% Variance)") +
  ggtitle("ROI PC Loadings by Cortical Region") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to interactive scatterplot with plotly
ggplotly(p.cortex, tooltip=c("label", "x", "y"))
rm(p.cortex)

The first note is that all of the ROIs exhibit a negative loading (correlation) with PC1. Beyond that, all of the occipital and parietal cortex ROIs are positively correlated with PC2, while the insula, temporal cortex, and cingulate cortex ROIs are all negatively correlated with PC2. The frontal cortex ROIs are right on the border of PC2, low correlations in both directions.

# Plot PC loadings colored by Braak stage
p.braak <- loadings_wide %>%
  ggplot(data=., mapping=aes(x=PC1, y=PC2, label=ROI)) +
  geom_hline(yintercept=0, linetype=2, alpha=0.5) +
  geom_vline(xintercept=0, linetype=2, alpha=0.5) +
  geom_point(aes(color=Braak), size=3) +
  theme_minimal() +
  xlab("PC1 (41.6% Variance)") +
  ylab("PC2 (14.0% Variance)") +
  ggtitle("ROI PC Loadings by Braak Stage") +
  theme(plot.title=element_text(hjust=0.5))

# Convert to interactive scatterplot with plotly
ggplotly(p.braak, tooltip=c("label", "x", "y"))
rm(p.braak)

There is not as clear a distinction to be made based on ROI Braak stage. One observation that does stand out is that all of the Braak VI ROIs are relatively close in the upper right of the points. Beyond that, the Braak stages are mixed in this loading plot.

Moving on, the subject and time interval info can be linked with the PCA results:

# post.pca = subject identification info + first five PCs
post.pca <- as.data.frame(res.pca$x[,1:5]) %>%
  cbind(., RID=annual.changes$RID) %>%
  cbind(., interval_num=annual.changes$interval_num) %>%
  cbind(., CDRSB=annual.changes$CDRSB) %>%
  cbind(., Age_Baseline=annual.changes$Age_Baseline) %>%
  cbind(., Sex=annual.changes$Sex) %>%
  select(RID, interval_num, Age_Baseline, Sex, CDRSB, PC1:PC5)

# print interactive datatable
datatable(post.pca %>% mutate_if(is.numeric, function(x) round(x,5)) %>% select(-Age_Baseline, -Sex))

Dummy variable encoding

The final step is to convert sex into a dummy variable for modeling:

# Encode sex as a binary variable for Sex_Male
annual.changes$Sex_Male <- ifelse(annual.changes$Sex=="Male", 1, 0)
post.pca$Sex_Male <- ifelse(post.pca$Sex=="Male", 1, 0)

# Remove original Sex feature
annual.changes <- annual.changes %>% select(-Sex)
post.pca <- post.pca %>% select(-Sex)

Phase Four: Modeling (Original Data)

Data split for train + test

As per standard convention in model development, I will randomly partition this dataset into training and testing subsets, such that the models are trained and evaluated in independent data subsets. I will use the sample function with a specific seed set (127) to partition the data into 75% training and 25% testing.

# Set seed for consistency in random sampling for 10-foldcross-validation
set.seed(127)
train.index <- sample(nrow(annual.changes), nrow(annual.changes)*0.75, replace=F)

# Remove unneccessary identifier info from datasets for modeling
original <- annual.changes %>% ungroup() %>% select(-RID, -interval_num)

# Pre-processing will be applied in model training with caret

# Subset training + test data for original (ROI) data
original.train <- original[train.index, ]
original.test <- original[-train.index, ]

Individual model training and tuning

I will be using the caretEnsemble package to compile four individual regression models into a stacked ensemble. This package enables evaluation of the models individually as well as together in the ensemble.

The first step is to create a caretList of the four regression models I will use:

  • Elastic net regression (glmnet)
  • k-nearest neighbors regression (knn)
  • Neural network regression (nnet)
  • Random forest regression (ranger)

I will use the trainControl function to specify ten-fold cross-validation with parallel processing.

# trainControl for 10-fold CV and parallel processing
ensemble.control <- trainControl(method="cv", number=10, allowParallel=T)

I’m using the RMSE as the metric according to which model parameters are tuned. All input data (which includes training and test data) will be automatically standardized using the preProcess center + scaling feature.

# Set seed for consistency
set.seed(127)

# Neural net -- linout=T means linear output (i.e. not constrained to be [0,1]),
# try a range of hidden layer sizes and weight decays
my.neuralnet <- caretModelSpec(method="nnet", linout=T, trace=F, 
                               tuneGrid = expand.grid(size=c(1, 3, 5, 10, 15), 
                                                      decay = seq(0, 1, by=0.2)))

# k-nearest neighbors: try 20 different values of k
my.knn <- caretModelSpec(method="knn", tuneLength=20)

# random forest: try 15 different numbers of features considered at each node and use 500 sampled trees
my.randomforest <- caretModelSpec(method="ranger", tuneLength=15, num.trees=500, importance="permutation")

# elastic net: try four different values of alpha for ridge/lasso blending and four lambda values for coefficient penalty
my.elasticnet <- caretModelSpec(method="glmnet",
                                tuneGrid=expand.grid(alpha=c(0,0.1,0.6,1), lambda=c(5^-5,5^-3,5^-1,1)))

# Compile individual models into one cohesive model list using caretList
invisible(capture.output(ensemble.models <- caretList(CDRSB ~ ., 
                                                      data=original.train, 
                                                      trControl=ensemble.control, 
                                                      metric="RMSE", 
                                                      preProcess=c("center", "scale"),
                                                      tuneList=list(my.neuralnet, 
                                                                    my.knn, 
                                                                    my.randomforest, 
                                                                    my.elasticnet))))

The final chosen parameters for each model can be viewed:

Elastic net final model

# Print elastic net model summary
ensemble.models$glmnet

# Elastic net cross-validation results
glmnet.alpha <- ensemble.models$glmnet$results$alpha
glmnet.rmse <- ensemble.models$glmnet$results$RMSE
glmnet.mae <- ensemble.models$glmnet$results$MAE
glmnet.lambda <- ensemble.models$glmnet$results$lambda

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.glmnet.cv <- data.frame(alpha=glmnet.alpha, RMSE=glmnet.rmse, MAE=glmnet.mae, lambda=glmnet.lambda) %>%
  mutate(alpha=as.character(alpha)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=lambda, y=Value, color=alpha)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=alpha)) +
  theme_minimal() +
  ggtitle("Elastic Net Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.glmnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "lambda", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "lambda", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

# Clear workspace
rm(glmnet.alpha, glmnet.rmse, glmnet.lambda, glmnet.mae, p.glmnet.cv, train.index)
## glmnet 
## 
## 246 samples
##  33 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda   RMSE       Rsquared   MAE      
##   0.0    0.00032  1.0024825  0.2653441  0.6913281
##   0.0    0.00800  1.0024825  0.2653441  0.6913281
##   0.0    0.20000  0.9898643  0.2445495  0.6370114
##   0.0    1.00000  1.0343083  0.1746313  0.6488788
##   0.1    0.00032  1.0272172  0.2559091  0.7255175
##   0.1    0.00800  1.0173454  0.2591756  0.7119920
##   0.1    0.20000  1.0054750  0.2281684  0.6327570
##   0.1    1.00000  1.0779697  0.1152146  0.6739505
##   0.6    0.00032  1.0276861  0.2555838  0.7261108
##   0.6    0.00800  1.0089328  0.2615413  0.6961165
##   0.6    0.20000  1.0846309  0.1464500  0.6735706
##   0.6    1.00000  1.0728707        NaN  0.6855987
##   1.0    0.00032  1.0279578  0.2554543  0.7261422
##   1.0    0.00800  1.0031054  0.2639103  0.6829836
##   1.0    0.20000  1.0795456  0.0510474  0.6794389
##   1.0    1.00000  1.0728707        NaN  0.6855987
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 0.2.

 

kNN final model

# Print kNN model summary
ensemble.models$knn

# kNN cross-validation plot
knn.k <- ensemble.models$knn$results$k
knn.rmse <- ensemble.models$knn$results$RMSE
knn.mae <- ensemble.models$knn$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.knn.cv <- data.frame(k=knn.k, RMSE=knn.rmse, MAE=knn.mae) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=k, y=Value, color=Metric)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ggtitle("kNN Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.knn.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "k", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "k", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(knn.rmse, knn.k, p.knn.cv, knn.mae)
## k-Nearest Neighbors 
## 
## 246 samples
##  33 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared    MAE      
##    5  1.116258  0.05554699  0.6276505
##    7  1.091992  0.07131052  0.6108052
##    9  1.100593  0.07790373  0.6130387
##   11  1.095345  0.06069154  0.6068137
##   13  1.094842  0.05226503  0.6059138
##   15  1.090812  0.04770270  0.5984284
##   17  1.083195  0.03942269  0.5999945
##   19  1.074670  0.04885680  0.5935437
##   21  1.074273  0.05665069  0.5929702
##   23  1.077304  0.05419062  0.5951436
##   25  1.078761  0.04598779  0.5941601
##   27  1.077614  0.05117091  0.5940578
##   29  1.075366  0.05022101  0.5902141
##   31  1.077190  0.03886373  0.5887206
##   33  1.078476  0.04163973  0.5886809
##   35  1.078656  0.04222438  0.5902351
##   37  1.078481  0.04605131  0.5884794
##   39  1.078717  0.04757272  0.5901374
##   41  1.076745  0.04494799  0.5894965
##   43  1.077263  0.04513458  0.5898005
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 21.

 

Neural network final model

# Print neural network model summary
ensemble.models$nnet

# Neural network cross-validation plot
n.neurons <- ensemble.models$nnet$results$size
nnet.rmse <- ensemble.models$nnet$results$RMSE
nnet.mae <- ensemble.models$nnet$results$MAE
nnet.weight <- ensemble.models$nnet$results$decay

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.nnet.cv <- data.frame(n.neurons, RMSE=nnet.rmse, MAE=nnet.mae, decay=nnet.weight) %>%
  mutate(decay=as.character(decay)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=n.neurons, y=Value, color=decay)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=decay)) +
  theme_minimal() +
  ggtitle("Neural Network Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.nnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Neurons in Hidden Layer", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Neurons in Hidden Layer", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(n.neurons, nnet.rmse, nnet.mae, nnet.weight, p.nnet.cv)
## Neural Network 
## 
## 246 samples
##  33 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared    MAE      
##    1    0.0    1.328625  0.13757164  0.7145074
##    1    0.2    1.264576  0.21623652  0.6936230
##    1    0.4    1.217789  0.30336524  0.6717363
##    1    0.6    1.191167  0.27985600  0.6736006
##    1    0.8    1.137362  0.30955177  0.6643404
##    1    1.0    1.098219  0.34422851  0.6551192
##    3    0.0    1.741657  0.20942663  1.0281973
##    3    0.2    1.323120  0.14957709  0.9070708
##    3    0.4    1.270399  0.10796698  0.8759850
##    3    0.6    1.154506  0.18820124  0.7570883
##    3    0.8    1.248311  0.18326235  0.8263540
##    3    1.0    1.080586  0.26200460  0.7156278
##    5    0.0    1.687248  0.12168509  1.1217697
##    5    0.2    1.463661  0.08668264  0.9990230
##    5    0.4    1.361751  0.12156121  0.9220889
##    5    0.6    1.211507  0.15780435  0.8068860
##    5    0.8    1.206544  0.13201464  0.8103969
##    5    1.0    1.237265  0.19792066  0.8266016
##   10    0.0    1.795388  0.02723955  1.3591350
##   10    0.2    1.320734  0.13099843  0.9296914
##   10    0.4    1.321488  0.11481149  0.9128680
##   10    0.6    1.282147  0.08057191  0.8739253
##   10    0.8    1.243587  0.11041300  0.8242700
##   10    1.0    1.231847  0.07735076  0.8450555
##   15    0.0    1.641867  0.04385296  1.2404813
##   15    0.2    1.428081  0.08025674  0.9245121
##   15    0.4    1.327971  0.06838231  0.8721033
##   15    0.6    1.250592  0.12015134  0.8584150
##   15    0.8    1.213121  0.13089912  0.8159638
##   15    1.0    1.165197  0.16543419  0.7901996
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 1.

The optimal neural network includes three neurons in the hidden layer, each of which receive input from all 33 input nodes (31 ROIs, baseline age, and sex) and output onto the final prediction node. Here’s a graphical representation of this caret-trained neural network:

par(mar = numeric(4))
plotnet(ensemble.models$nnet, cex_val=0.8, pad_x=0.6, pos_col="firebrick3", neg_col="dodgerblue4",
        circle_col="lightslategray", bord_col="lightslategray", alpha_val=0.4)

Random forest final model

# Print random forest model summary
ensemble.models$ranger

# Random forest cross-validation plot
splitrule <- ensemble.models$ranger$results$splitrule
numpred <- ensemble.models$ranger$results$mtry
rf.rmse <- ensemble.models$ranger$results$RMSE
rf.mae <- ensemble.models$ranger$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.rf.cv <- data.frame(splitrule, RMSE=rf.rmse, MAE=rf.mae, numpred) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=numpred, y=Value, color=splitrule)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=splitrule)) +
  theme_minimal() +
  ggtitle("Random Forest Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.rf.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Predictors in Decision Node", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Predictors in Decision Node", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(splitrule, numpred, rf.rmse, rf.mae, p.rf.cv)
## Random Forest 
## 
## 246 samples
##  33 predictor
## 
## Pre-processing: centered (33), scaled (33) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared    MAE      
##    2    variance    1.064625  0.12159510  0.6450826
##    2    extratrees  1.048859  0.09012903  0.6413026
##    4    variance    1.084291  0.12387114  0.6533298
##    4    extratrees  1.055776  0.08186302  0.6453438
##    6    variance    1.099507  0.11633353  0.6585372
##    6    extratrees  1.051749  0.10265665  0.6381747
##    8    variance    1.109254  0.12427881  0.6619890
##    8    extratrees  1.062632  0.09373095  0.6534412
##   10    variance    1.115296  0.12228989  0.6648207
##   10    extratrees  1.065214  0.10195274  0.6497414
##   13    variance    1.115753  0.11651169  0.6595782
##   13    extratrees  1.061037  0.10938914  0.6461101
##   15    variance    1.127241  0.12307379  0.6670222
##   15    extratrees  1.063785  0.10712628  0.6501017
##   17    variance    1.138345  0.12177977  0.6716269
##   17    extratrees  1.068892  0.09720528  0.6493183
##   19    variance    1.134277  0.11795628  0.6700897
##   19    extratrees  1.075894  0.10123054  0.6558778
##   21    variance    1.142939  0.12049751  0.6717537
##   21    extratrees  1.071633  0.10974243  0.6459720
##   24    variance    1.150516  0.12156153  0.6760727
##   24    extratrees  1.073411  0.11032160  0.6513125
##   26    variance    1.146694  0.13039576  0.6744959
##   26    extratrees  1.073445  0.10615220  0.6524639
##   28    variance    1.151409  0.11795867  0.6719930
##   28    extratrees  1.077274  0.11107250  0.6552030
##   30    variance    1.151527  0.12254689  0.6722165
##   30    extratrees  1.088744  0.11102570  0.6542741
##   33    variance    1.155188  0.12900397  0.6747987
##   33    extratrees  1.085570  0.10789043  0.6583809
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 2, splitrule = extratrees
##  and min.node.size = 5.

Individual model performance

These four models can be resampled and aggregated using the resamples function:

# Resample the performance of this ensemble and report summary metrics for MAE, RMSE, and R2
set.seed(127)
ensemble.results <- resamples(ensemble.models)
summary(ensemble.results)
## 
## Call:
## summary.resamples(object = ensemble.results)
## 
## Models: nnet, knn, ranger, glmnet 
## Number of resamples: 10 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## nnet   0.5034794 0.5548646 0.7226288 0.7156278 0.8724837 0.8959477    0
## knn    0.3477478 0.4566578 0.6387360 0.5929702 0.6904719 0.9059611    0
## ranger 0.4051231 0.5121434 0.6982644 0.6413026 0.7523551 0.9095211    0
## glmnet 0.3758580 0.4969080 0.6371010 0.6370114 0.7929573 0.8404093    0
## 
## RMSE 
##             Min.   1st Qu.    Median      Mean  3rd Qu.     Max. NA's
## nnet   0.6660448 0.7485084 1.0835559 1.0805864 1.247121 1.640732    0
## knn    0.5047511 0.6937286 1.1151631 1.0742732 1.370107 1.912549    0
## ranger 0.5232129 0.6800173 1.0803837 1.0488592 1.330690 1.821540    0
## glmnet 0.4601160 0.6378514 0.9067867 0.9898643 1.336987 1.735698    0
## 
## Rsquared 
##                Min.     1st Qu.     Median       Mean    3rd Qu.      Max. NA's
## nnet   0.0400650563 0.131463965 0.33016028 0.26200460 0.34807695 0.4577081    0
## knn    0.0007697474 0.006319066 0.02821930 0.05665069 0.05807266 0.2418623    0
## ranger 0.0009920488 0.016495845 0.06184173 0.09012903 0.15514295 0.2557172    0
## glmnet 0.0277449563 0.103045558 0.19118049 0.24454950 0.30601572 0.6654200    0

Looking at the mean RMSE across sample iterations, the elastic net (glmnet) has the lowest RMSE; by contrast, the neural network has the highest RMSE. This reports the \(R^2\) values as well, though as the models are non-linear regressions (aside from elastic net), this isn’t a valid comparison metric in this instance.

It’s also useful to look at the regression correlation between these component models:

cat("\nRoot-Mean Square Error Correlation\n")
# Calculate model RMSE correlations
modelCor(ensemble.results, metric="RMSE")
cat("\n\nMean Absolute Error Correlation\n")
# Calculate model MAE correlations
modelCor(ensemble.results, metric="MAE")
## 
## Root-Mean Square Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.7691416 0.8088357 0.8862915
## knn    0.7691416 1.0000000 0.9941858 0.9496195
## ranger 0.8088357 0.9941858 1.0000000 0.9655972
## glmnet 0.8862915 0.9496195 0.9655972 1.0000000
## 
## 
## Mean Absolute Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.7491175 0.7822654 0.8545223
## knn    0.7491175 1.0000000 0.9890882 0.8382916
## ranger 0.7822654 0.9890882 1.0000000 0.8519863
## glmnet 0.8545223 0.8382916 0.8519863 1.0000000

kNN and random forest show very high error correlation (R>0.98). The other models also show relatively high error correlation; this is not ideal, since ensemble models are designed to counterbalance individual model error. The error can be visualized with confidence intervals using the dotplot function:

# Plot the four models' RMSE values
p.rmse <- as.ggplot(dotplot(ensemble.results, metric="RMSE"))
# Plot the four models' MAE values
p.mae <- as.ggplot(dotplot(ensemble.results, metric="MAE")) 
# Combine plots side-by-side
grid.arrange(p.rmse, p.mae, ncol=2)

These four models show comparable RMSE confidence intervals. The elastic net regression model shows the lowest estimated RMSE, while the neural network shows the highest estimated RMSE. kNN showed the lowest MAE, while the neural network also showed the highest MAE. Despite the large error correlation between the models, I’ll move forward to see if ensembling strengthens the overall predictions for the change in CDR-Sum of Boxes over time.

Model ensemble training and tuning

I will compare three different types of stacked ensembles:

  1. Default caretEnsemble, stacked via generalized linear model-defined component model combination
  2. Stacked ensemble via caretStack using random forest-defined linear combination of component models
  3. Stacked ensemble via caretStack using elastic net-defined linear combination of component models

Genereralized linear ensemble

Starting with the basic caretEnsemble function, which by default employs a generalized linear model to combine the component models:

set.seed(127)
# Set trainControl --> 10-fold CV with parallel processing
ensemble.control <- trainControl(method="repeatedcv", number=10, allowParallel = T)

# Create stacked ensemble, optimizing for RMSE
stacked.ensemble.glm <- caretEnsemble(ensemble.models, metric="RMSE", trControl=ensemble.control)
summary(stacked.ensemble.glm)
## The following models were ensembled: nnet, knn, ranger, glmnet 
## They were weighted: 
## -0.0287 0.2595 -0.271 0.4013 0.5987
## The resulting RMSE is: 1.0158
## The fit for each individual model on the RMSE is: 
##  method      RMSE    RMSESD
##    nnet 1.0805864 0.3570005
##     knn 1.0742732 0.4443390
##  ranger 1.0488592 0.4220274
##  glmnet 0.9898643 0.4280575

The elastic net model (glmnet) shows the lowest RMSE of the four models. caret offers a function autoplot to create in-depth diagnostic plots for ensemble models:

autoplot(stacked.ensemble.glm)

The top left graph shows the mean cross-validated RMSE per model, with the bars denoting RMSE standard deviation. The elastic net model shows the lowest RMSE of the four, and its RMSE is even lower than that of the full ensemble model (indicated by the red dashed line). The middle left plot shows the relative weight given to each model in a linear combination; the elastic net has the highest weighting, followed by ranger and neural network; kNN actually has a negative weight.

Stacked ensembles

Stacked ensembles can also be constructed using caretStack, which applies user-defined linear combinations of each constituent model. I’ll try one using a random forest combination of models and one using an elastic net combination of models.

set.seed(127)
# Create random forest-combined stacked ensemble
stacked.ensemble.rf <- caretStack(ensemble.models, method = "rf", metric = "RMSE", trControl = ensemble.control)
# Create elastic net-combined stacked ensemble
stacked.ensemble.glmnet <- caretStack(ensemble.models, method="glmnet", metric="RMSE", trControl = ensemble.control)

The summary statistics for each model can be displayed:

cat("\nStacked ensemble, generalized linear model:\n") 
stacked.ensemble.glm 
cat("\n\nStacked ensemble, random forest:\n")
stacked.ensemble.rf
cat("\n\nStacked ensemble, elastic net:\n")
stacked.ensemble.glmnet
## 
## Stacked ensemble, generalized linear model:
## A glm ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE      
##   1.015842  0.1806129  0.6286504
## 
## 
## 
## Stacked ensemble, random forest:
## A rf ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Random Forest 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE      
##   2     1.088156  0.1663274  0.6343777
##   3     1.090313  0.1593837  0.6389241
##   4     1.105520  0.1455903  0.6431148
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
## 
## 
## Stacked ensemble, elastic net:
## A glmnet ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## glmnet 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 222, 222, 222, 221, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE      Rsquared   MAE      
##   0.10   0.0008093379  1.032366  0.1937303  0.6374587
##   0.10   0.0080933788  1.032313  0.1938340  0.6373587
##   0.10   0.0809337883  1.029287  0.1961329  0.6336675
##   0.55   0.0008093379  1.032572  0.1935667  0.6376566
##   0.55   0.0080933788  1.032505  0.1934793  0.6373892
##   0.55   0.0809337883  1.028034  0.1961576  0.6308639
##   1.00   0.0008093379  1.032600  0.1935506  0.6376821
##   1.00   0.0080933788  1.032372  0.1933438  0.6373242
##   1.00   0.0809337883  1.028801  0.1950727  0.6297854
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.55 and lambda = 0.08093379.

Each of these ensemble models show pretty comparable RMSE and MAE values. The RMSE is slighly lower for the glmnet-combined stack ensemble model, though this difference may not be significant and may not translate to the out-of-sample data. These three ensemble models (glm-, rf-, and glmnet-combined) will be used to predict annual change in CDR-Sum of Boxes in the same test dataset. Note: since the models were created with center and scale preprocessing specified, the test data does not need to be manually pre-processed.

Model predictions

Training data predictions

# Predict based on training data for the four individual models
# And the three stacked ensemble models
glmnet.train <- predict.train(ensemble.models$glmnet)
knn.train <- predict.train(ensemble.models$knn)
nnet.train <- predict.train(ensemble.models$nnet)
rf.train <- predict(ensemble.models$ranger)
ensemble.glm.train <- predict(stacked.ensemble.glm)
ensemble.glmnet.train <- predict(stacked.ensemble.glmnet)
ensemble.rf.train <- predict(stacked.ensemble.rf)
real.train <- original.train$CDRSB

# Combine these predictions into a dataframe for easier viewing
train.df <- do.call(cbind, list(elastic.net=glmnet.train, knn=knn.train, neural.net=nnet.train, random.forest=rf.train,
                                ensemble.glm=ensemble.glm.train, ensemble.glmnet=ensemble.glmnet.train, 
                                ensemble.rf=ensemble.rf.train, real.CDR=real.train)) %>% as.data.frame()

datatable(train.df %>% mutate_if(is.numeric, function(x) round(x,4)) %>% select(real.CDR, elastic.net:ensemble.rf))

Test data predictions

# Predict based on UNSEEN test data for the four individual models
# And the three stacked ensemble models
real.test <- original.test$CDRSB  
glmnet.test <- predict.train(ensemble.models$glmnet, newdata=original.test)
knn.test <- predict.train(ensemble.models$knn, newdata=original.test)
nnet.test <- predict.train(ensemble.models$nnet, newdata=original.test)
rf.test <- predict.train(ensemble.models$ranger, newdata=original.test)
ensemble.glm.test <- predict(stacked.ensemble.glm, newdata=original.test)
ensemble.glmnet.test <- predict(stacked.ensemble.glmnet, newdata=original.test)
ensemble.rf.test <- predict(stacked.ensemble.rf, newdata=original.test)

# Combine these predictions into a dataframe for easier viewing
test.df <- do.call(cbind, list(real.CDR=real.test, elastic.net=glmnet.test, knn=knn.test, neural.net=nnet.test, random.forest=rf.test,
                               ensemble.glm=ensemble.glm.test, ensemble.glmnet=ensemble.glmnet.test, ensemble.rf=ensemble.rf.test)) %>% as.data.frame()

datatable(test.df %>% mutate_if(is.numeric, function(x) round(x,4)))

Predicted vs. real CDR-SoB comparison

I want to compare these three ensemble models in terms of how their predictions relate to the actual CDR-SoB values in the training and testing data. I also want to compare these results with those obtained with the individual component models to see if constructing the ensemble confers a predictive advantage. First, I will visualize how the predicted values stack up to to the actual value for annual change in CDR-Sum of Boxes, in both the training and the test data:

# Create training data ggplot to be converted to interactive plotly plot
p.train <- train.df %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Training Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Create test data ggplot to be converted to interactive plotly plot
p.test <- test.df  %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Test Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Use ggplotly to create interactive HTML plots
ggplotly(p.train, height=350, width=900)
ggplotly(p.test, height=350, width=900) 

To quantify the association between real CDR-SoB values and model-predicted, I will use the \(R^2\) and RMSE values through the R2 and RMSE functions from caret:

# Calculate the RMSE between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
rmse.train <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.train, real.train),
                         ensemble.glm=RMSE(ensemble.glm.train, real.train),
                         ensemble.rf=RMSE(ensemble.rf.train, real.train),
                         elastic.net=RMSE(glmnet.train, real.train),
                         knn=RMSE(knn.train, real.train),
                         neural.net=RMSE(nnet.train, real.train),
                         random.forest=RMSE(rf.train, real.train),
                         Metric="Train_RMSE")
# Calculate the RMSE between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
rmse.test <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.test, real.test),
                        ensemble.glm=RMSE(ensemble.glm.test, real.test),
                        ensemble.rf=RMSE(ensemble.rf.test, real.test),
                        elastic.net=RMSE(glmnet.test, real.test),
                        knn=RMSE(knn.test, real.test),
                        neural.net=RMSE(nnet.test, real.test),
                        random.forest=RMSE(rf.test, real.test),
                        Metric="Test_RMSE")
# Calculate the R-squared between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
r2.train <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.train, real.train),
                       ensemble.glm=R2(ensemble.glm.train, real.train),
                       ensemble.rf=R2(ensemble.rf.train, real.train),
                       elastic.net=R2(glmnet.train, real.train),
                       knn=R2(knn.train, real.train),
                       neural.net=R2(nnet.train, real.train),
                       random.forest=R2(rf.train, real.train),
                       Metric="Train_R2")
# Calculate the R-squared between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
r2.test <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.test, real.test),
                      ensemble.glm=R2(ensemble.glm.test, real.test),
                      ensemble.rf=R2(ensemble.rf.test, real.test),
                      elastic.net=R2(glmnet.test, real.test),
                      knn=R2(knn.test, real.test),
                      neural.net=R2(nnet.test, real.test),
                      random.forest=R2(rf.test, real.test),
                      Metric="Test_R2")
# Combine all four prediction dataframes into one table to compare and contrast
# RMSE and R-squared across models
do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  mutate(Metric = str_replace(Metric, "_", " ")) %>%
  pivot_wider(id_cols="Model", names_from="Metric", values_from="Value") %>%
  mutate_if(is.numeric, function(x) round(x,4)) %>%
  kable(., booktabs=T) %>% kable_styling(full_width=F)
Model Train RMSE Test RMSE Train R2 Test R2
ensemble.glmnet 0.8331 0.8962 0.5983 0.0367
ensemble.glm 0.6904 0.8975 0.7145 0.0496
ensemble.rf 0.9600 1.0417 0.3158 0.0203
elastic.net 0.9191 0.8924 0.3811 0.0495
knn 1.1073 0.9039 0.1119 0.0482
neural.net 0.6621 1.1349 0.6813 0.0051
random.forest 0.6490 0.8361 0.9375 0.1582

The predicted CDR-SoB change values from the random forest individual model were very similar to the actual observed values, yielding an \(R^2\) of 0.94. However, comparing this with the \(R^2\) of 0.16 in the test dataset suggests that the model may be overfit to the training data and does not perform well outside of that dataset. The same can be said of the other models, all of which exhibited substantially larger agreement between predictions and actual values in the training dataset than in the out-of-sample test dataset.

# Compile RMSE and R2 results comparing real vs. predicted values for ensembles and component models
overall.ensemble.results <- do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  # Reshape to facet on metric -- i.e. RMSE or R2
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  separate(Metric, into=c("Data", "Metric"), sep="_")

p.ensemble.r2.rmse <- overall.ensemble.results %>%
  mutate(Metric = ifelse(Metric=="RMSE", "Real vs. Predicted CDR-SoB RMSE", "Real vs. Predicted CDR-SoB R2")) %>%
  mutate(Data=factor(Data, levels=c("Train", "Test")),
         Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Data, y=Value, color=Model, group=Model)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  facet_wrap(Metric~., scales="free", nrow=1) +
  theme(strip.text=element_text(size=12, face="bold"),
        axis.title=element_blank())

# Convert to interactive plotly plot, rename x/y axis titles
ggplotly(p.ensemble.r2.rmse) %>% 
  layout(yaxis = list(title = "R2", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "Data Subset", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "Data Subset", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

Re-name objects to have names specific to original dataset:

# Rename relevant objects to have original data-specific names
models.for.ensemble.OG <- ensemble.models
model.metrics.OG <- overall.ensemble.results
stacked.ensemble.glm.OG <- stacked.ensemble.glm
stacked.ensemble.glmnet.OG <- stacked.ensemble.glmnet
stacked.ensemble.rf.OG <- stacked.ensemble.rf

Phase Four: Modeling (PCA Data)

Data split for train + test

As per standard convention in model development, I will randomly partition this dataset into training and testing subsets, such that the models are trained and evaluated in independent data subsets. I will use the sample function with a specific seed set (127) to partition the data into 75% training and 25% testing.

# Set seed for consistency in random sampling for 10-fold cross-validation
set.seed(127)
train.index <- sample(nrow(post.pca), nrow(post.pca)*0.75, replace=F)

# Remove unneccessary identifier info from datasets for modeling
pca.df <- post.pca %>% ungroup() %>% select(-RID, -interval_num)

# Pre-processing will be applied in model training with caret

# Subset training + test data for PCA-transformed data
pca.train <- pca.df[train.index, ]
pca.test <- pca.df[-train.index, ]

Individual model training and tuning

I will be using the caretEnsemble package to compile four individual regression models into a stacked ensemble. This package enables evaluation of the models individually as well as together in the ensemble.

The first step is to create a caretList of the four regression models I will use:

  • Elastic net regression (glmnet)
  • k-nearest neighbors regression (knn)
  • Neural network regression (nnet)
  • Random forest regression (ranger)

I will use the trainControl function to specify ten-fold cross-validation with parallel processing.

# trainControl for 10-fold CV and parallel processing
ensemble.control <- trainControl(method="cv", number=10, allowParallel=T)

I’m using the RMSE as the metric according to which model parameters are tuned. All input data (which includes training and test data) will be automatically standardized using the preProcess center + scaling feature.

# Set seed for consistency
set.seed(127)

# Neural net -- linout=T means linear output (i.e. not constrained to be [0,1]),
# try a range of hidden layer sizes and weight decays
my.neuralnet <- caretModelSpec(method="nnet", linout=T, trace=F, tuneGrid = expand.grid(size=c(1, 3, 5, 10, 15), decay = seq(0, 1, by=0.2)))

# k-nearest neighbors: try 20 different values of k
my.knn <- caretModelSpec(method="knn", tuneLength=20)

# random forest: try 15 different numbers of features considered at each node and use 500 sampled trees
my.randomforest <- caretModelSpec(method="ranger", tuneLength=15, num.trees=500, importance="permutation")

# elastic net: try four different values of alpha for ridge/lasso blending and four lambda values for coefficient penalty
my.elasticnet <- caretModelSpec(method="glmnet",tuneGrid=expand.grid(alpha=c(0,0.1,0.6,1), lambda=c(5^-5,5^-3,5^-1,1)))
# Compile individual models into one cohesive model list using caretList
invisible(capture.output(ensemble.models <- caretList(CDRSB ~ ., 
                                                      data=pca.train, 
                                                      trControl=ensemble.control, 
                                                      metric="RMSE", 
                                                      preProcess=c("center", "scale"),
                                                      tuneList=list(my.neuralnet, my.knn, my.randomforest, my.elasticnet))))

The final chosen parameters for each model can be viewed:

Elastic net final model

# Print elastic net model summary
ensemble.models$glmnet

# Elastic net cross-validation results
glmnet.alpha <- ensemble.models$glmnet$results$alpha
glmnet.rmse <- ensemble.models$glmnet$results$RMSE
glmnet.mae <- ensemble.models$glmnet$results$MAE
glmnet.lambda <- ensemble.models$glmnet$results$lambda

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.glmnet.cv <- data.frame(alpha=glmnet.alpha, RMSE=glmnet.rmse, MAE=glmnet.mae, lambda=glmnet.lambda) %>%
  mutate(alpha=as.character(alpha)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=lambda, y=Value, color=alpha)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=alpha)) +
  theme_minimal() +
  ggtitle("Elastic Net Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.glmnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "lambda", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "lambda", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

# Clear workspace
rm(glmnet.alpha, glmnet.rmse, glmnet.lambda, glmnet.mae, p.glmnet.cv, train.index)
## glmnet 
## 
## 246 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda   RMSE      Rsquared    MAE      
##   0.0    0.00032  1.120429  0.09655352  0.7065869
##   0.0    0.00800  1.120429  0.09655352  0.7065869
##   0.0    0.20000  1.111204  0.09264358  0.6992882
##   0.0    1.00000  1.092248  0.08877960  0.6888541
##   0.1    0.00032  1.121643  0.09689871  0.7074884
##   0.1    0.00800  1.121216  0.09676384  0.7070698
##   0.1    0.20000  1.103896  0.10168777  0.6900957
##   0.1    1.00000  1.078150  0.05497418  0.6799398
##   0.6    0.00032  1.121655  0.09694159  0.7074408
##   0.6    0.00800  1.119002  0.09673422  0.7044619
##   0.6    0.20000  1.080718  0.05926795  0.6763468
##   0.6    1.00000  1.072871         NaN  0.6855987
##   1.0    0.00032  1.121677  0.09694846  0.7074455
##   1.0    0.00800  1.117345  0.09722338  0.7024775
##   1.0    0.20000  1.075379  0.06960295  0.6799288
##   1.0    1.00000  1.072871         NaN  0.6855987
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0.6 and lambda = 1.

Based on RMSE, lambda=1 was chosen – this is the largest penalty value of the four assessed lambda values. Additionally, alpha=0.6 was chosen, meaning the model is a blend of ridge and lasso regression, leaning slightly toward lasso regression (which uses L1 regularization).

 

kNN final model

# Print kNN model summary
ensemble.models$knn

# kNN cross-validation plot
knn.k <- ensemble.models$knn$results$k
knn.rmse <- ensemble.models$knn$results$RMSE
knn.mae <- ensemble.models$knn$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.knn.cv <- data.frame(k=knn.k, RMSE=knn.rmse, MAE=knn.mae) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=k, y=Value, color=Metric)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line() +
  theme_minimal() +
  ggtitle("kNN Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.knn.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "k", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "k", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(knn.rmse, knn.k, p.knn.cv, knn.mae)
## k-Nearest Neighbors 
## 
## 246 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   k   RMSE      Rsquared    MAE      
##    5  1.095632  0.08289345  0.6389863
##    7  1.099697  0.05080179  0.6478129
##    9  1.095010  0.03837144  0.6484233
##   11  1.084744  0.03278807  0.6288731
##   13  1.081373  0.03914047  0.6244824
##   15  1.073794  0.05760648  0.6188573
##   17  1.070625  0.06005288  0.6145598
##   19  1.077580  0.04757898  0.6178363
##   21  1.078646  0.04796042  0.6193779
##   23  1.080235  0.04242322  0.6200468
##   25  1.079717  0.04149783  0.6173809
##   27  1.072803  0.05326838  0.6150709
##   29  1.071514  0.05248265  0.6134723
##   31  1.071258  0.05519706  0.6133139
##   33  1.074527  0.04573226  0.6114621
##   35  1.072335  0.05369971  0.6078421
##   37  1.072503  0.05093173  0.6069605
##   39  1.072482  0.05254741  0.6071913
##   41  1.072883  0.05328870  0.6059066
##   43  1.073248  0.04967071  0.6059558
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was k = 17.

k=17 was chosen to minimize the RMSE. This value is also a local minima for the MAE, though the MAE decreases with larger values of k.

 

Neural network final model

# Print neural network model summary
ensemble.models$nnet

# Neural network cross-validation plot
n.neurons <- ensemble.models$nnet$results$size
nnet.rmse <- ensemble.models$nnet$results$RMSE
nnet.mae <- ensemble.models$nnet$results$MAE
nnet.weight <- ensemble.models$nnet$results$decay

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.nnet.cv <- data.frame(n.neurons, RMSE=nnet.rmse, MAE=nnet.mae, decay=nnet.weight) %>%
  mutate(decay=as.character(decay)) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=n.neurons, y=Value, color=decay)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=decay)) +
  theme_minimal() +
  ggtitle("Neural Network Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.nnet.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Neurons in Hidden Layer", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Neurons in Hidden Layer", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(n.neurons, nnet.rmse, nnet.mae, nnet.weight, p.nnet.cv)
## Neural Network 
## 
## 246 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE      Rsquared    MAE      
##    1    0.0    1.245644  0.09270336  0.7320901
##    1    0.2    1.192384  0.05973576  0.7215185
##    1    0.4    1.167875  0.06849000  0.6952365
##    1    0.6    1.166977  0.09277111  0.7182222
##    1    0.8    1.138191  0.07626984  0.7016022
##    1    1.0    1.147937  0.11282020  0.7182524
##    3    0.0    1.279886  0.08310500  0.7504749
##    3    0.2    1.245645  0.07589623  0.7871604
##    3    0.4    1.199659  0.08438305  0.7179262
##    3    0.6    1.153764  0.08378698  0.7050315
##    3    0.8    1.151742  0.05853814  0.6968032
##    3    1.0    1.139317  0.04608047  0.6920970
##    5    0.0    2.852068  0.10901075  1.2135027
##    5    0.2    1.186223  0.07133170  0.7566989
##    5    0.4    1.299139  0.05493359  0.8220907
##    5    0.6    1.240741  0.03175004  0.7999623
##    5    0.8    1.150480  0.03837154  0.7027839
##    5    1.0    1.152905  0.04999714  0.7128881
##   10    0.0    1.617319  0.04562712  1.1305324
##   10    0.2    1.312021  0.05878781  0.8568002
##   10    0.4    1.283676  0.06840569  0.8449359
##   10    0.6    1.172149  0.05627968  0.7334233
##   10    0.8    1.163288  0.05143240  0.7458729
##   10    1.0    1.186060  0.04664239  0.7251889
##   15    0.0    1.828227  0.10215527  1.2971160
##   15    0.2    1.440809  0.05869640  0.9635911
##   15    0.4    1.286711  0.03298476  0.8384455
##   15    0.6    1.241962  0.06854765  0.7906647
##   15    0.8    1.184070  0.06865887  0.7409692
##   15    1.0    1.159711  0.02835005  0.7211840
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 1 and decay = 0.8.

The optimal neural network based on RMSE includes on neuron in the hidden layer, which receives input from all 33 input nodes (31 ROIs, baseline age, and sex) and outputs onto the final prediction node. The decay weight was chosen as 0.8. Here’s a graphical representation of this caret-trained neural network:

par(mar = numeric(4))
plotnet(ensemble.models$nnet, cex_val=0.8, pad_x=0.6, pos_col="firebrick3", neg_col="dodgerblue4",
        circle_col="lightslategray", bord_col="lightslategray", alpha_val=0.4)

Age at baseline, PC1, PC5, and sex (male) all exert negative weights onto the hidden layer, while PC2, PC3, and PC4 exert positive weights.

Random forest final model

# Print random forest model summary
ensemble.models$ranger

# Random forest cross-validation plot
splitrule <- ensemble.models$ranger$results$splitrule
numpred <- ensemble.models$ranger$results$mtry
rf.rmse <- ensemble.models$ranger$results$RMSE
rf.mae <- ensemble.models$ranger$results$MAE

# Plot the RMSE and MAE in a facet plot using facet_wrap
p.rf.cv <- data.frame(splitrule, RMSE=rf.rmse, MAE=rf.mae, numpred) %>%
  pivot_longer(cols=c(RMSE, MAE), names_to="Metric", values_to="Value") %>%
  mutate(Metric = ifelse(Metric=="MAE", "Mean Absolute Error", "Root-Mean Square Error")) %>%
  # One line per value of alpha
  ggplot(data=., mapping=aes(x=numpred, y=Value, color=splitrule)) +
  # Facet the MAE and RMSE in separate plots
  facet_wrap(Metric ~ ., scales="free") +
  geom_point() +
  geom_line(aes(group=splitrule)) +
  theme_minimal() +
  ggtitle("Random Forest Regression Cross-Validated Results") +
  theme(plot.title=element_text(hjust=0.5),
        axis.title=element_blank(),
        panel.spacing = unit(2, "lines"))

# Convert to interactive plotly
ggplotly(p.rf.cv) %>% 
  layout(yaxis = list(title = "MAE", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "# Predictors in Decision Node", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "# Predictors in Decision Node", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

rm(splitrule, numpred, rf.rmse, rf.mae, p.rf.cv)
## Random Forest 
## 
## 246 samples
##   7 predictor
## 
## Pre-processing: centered (7), scaled (7) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 221, 222, 221, 222, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   RMSE      Rsquared    MAE      
##   2     variance    1.087704  0.09159020  0.6645584
##   2     extratrees  1.076396  0.04057227  0.6589970
##   3     variance    1.097318  0.09194553  0.6706066
##   3     extratrees  1.075592  0.05217097  0.6561724
##   4     variance    1.105181  0.09240744  0.6759372
##   4     extratrees  1.079706  0.05795037  0.6574317
##   5     variance    1.109070  0.09926728  0.6742750
##   5     extratrees  1.084472  0.05483920  0.6599709
##   6     variance    1.106586  0.09518613  0.6773890
##   6     extratrees  1.089994  0.05894553  0.6637487
##   7     variance    1.110481  0.09341193  0.6773223
##   7     extratrees  1.094493  0.05715666  0.6636269
## 
## Tuning parameter 'min.node.size' was held constant at a value of 5
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 3, splitrule = extratrees
##  and min.node.size = 5.

An mtry of 3 with an extratrees split rule yielded the lowest RMSE, meaning three predictor terms are considered at each decision node. This combination also yielded the lowest MAE value.

Individual model performance

These four models can be resampled and aggregated using the resamples function:

# Resample the performance of this ensemble and report summary metrics for MAE, RMSE, and R2
set.seed(127)
ensemble.results <- resamples(ensemble.models)
summary(ensemble.results)
## 
## Call:
## summary.resamples(object = ensemble.results)
## 
## Models: nnet, knn, ranger, glmnet 
## Number of resamples: 10 
## 
## MAE 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## nnet   0.4206481 0.5442455 0.7859245 0.7016022 0.8214511 0.9074631    0
## knn    0.3460927 0.4526976 0.6951723 0.6145598 0.7222040 0.8729141    0
## ranger 0.4110419 0.5181849 0.6957156 0.6561724 0.7555823 0.9238009    0
## glmnet 0.4474888 0.5692098 0.7479911 0.6855987 0.7901979 0.9026879    0
## 
## RMSE 
##             Min.   1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## nnet   0.5935476 0.7122655 1.244294 1.138191 1.393087 1.877673    0
## knn    0.4889936 0.6738322 1.140509 1.070625 1.387874 1.874883    0
## ranger 0.6181761 0.7294735 1.101456 1.075592 1.301085 1.863276    0
## glmnet 0.5007390 0.7217365 1.101043 1.072871 1.375093 1.818658    0
## 
## Rsquared 
##               Min.    1st Qu.     Median       Mean    3rd Qu.      Max. NA's
## nnet   0.001466325 0.03338898 0.06436858 0.07626984 0.08118771 0.2525644    0
## knn    0.004153727 0.01810022 0.03311722 0.06005288 0.07510692 0.2381806    0
## ranger 0.003690445 0.01414694 0.03767737 0.05217097 0.08695230 0.1530365    0
## glmnet          NA         NA         NA        NaN         NA        NA   10

The values of NA for the glmnet model \(R^2\) indicate that the model outputs all the same predictions, possibly indicating that the model only includes the intercept term after coefficient shrinking. The kNN model has the lowest mean MAE and RMSE out of all the models.

It’s also useful to look at the regression correlation between these component models:

cat("\nRoot-Mean Square Error Correlation\n")
# Calculate model RMSE correlations
modelCor(ensemble.results, metric="RMSE")
cat("\n\nMean Absolute Error Correlation\n")
# Calculate model MAE correlations
modelCor(ensemble.results, metric="MAE")
## 
## Root-Mean Square Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.9889090 0.9867333 0.9820596
## knn    0.9889090 1.0000000 0.9918527 0.9957175
## ranger 0.9867333 0.9918527 1.0000000 0.9873565
## glmnet 0.9820596 0.9957175 0.9873565 1.0000000
## 
## 
## Mean Absolute Error Correlation
##             nnet       knn    ranger    glmnet
## nnet   1.0000000 0.9664251 0.9553354 0.9660434
## knn    0.9664251 1.0000000 0.9838072 0.9753176
## ranger 0.9553354 0.9838072 1.0000000 0.9632123
## glmnet 0.9660434 0.9753176 0.9632123 1.0000000

The models are extremely correlated in terms of error (all R>0.95). This is not ideal, since ensemble models are designed to counterbalance individual model error. The error can be visualized with confidence intervals using the dotplot function:

# Plot the four models' RMSE values
p.rmse <- as.ggplot(dotplot(ensemble.results, metric="RMSE"))
# Plot the four models' MAE values
p.mae <- as.ggplot(dotplot(ensemble.results, metric="MAE")) 
# Combine plots side-by-side
grid.arrange(p.rmse, p.mae, ncol=2)

The models all show very similar RMSE confidence intervals. The kNN model shows a lower MAE than the other three models, while the neural network showed the highest MAE value. Despite the large error correlation between the models, I’ll move forward to see if ensembling strengthens the overall predictions for the change in CDR-Sum of Boxes over time.

Model ensemble training and tuning

Genereralized linear ensemble

Starting with the basic caretEnsemble function, which by default employs a generalized linear model to combine the component models:

set.seed(127)
# Set trainControl --> 10-fold CV with parallel processing
ensemble.control <- trainControl(method="repeatedcv", number=10, allowParallel = T)

# Create stacked ensemble, optimizing for RMSE
stacked.ensemble.glm <- caretEnsemble(ensemble.models, metric="RMSE", trControl=ensemble.control)
summary(stacked.ensemble.glm)
## The following models were ensembled: nnet, knn, ranger, glmnet 
## They were weighted: 
## 3.3024 -1.0743 -0.307 1.0037 -8.0853
## The resulting RMSE is: 1.0619
## The fit for each individual model on the RMSE is: 
##  method     RMSE    RMSESD
##    nnet 1.138191 0.4385607
##     knn 1.070625 0.4549577
##  ranger 1.075592 0.4061609
##  glmnet 1.072871 0.4254785

All the models show similar RMSE values, but kNN offers the lowest by a slim margin. caret offers a function autoplot to create in-depth diagnostic plots for ensemble models:

autoplot(stacked.ensemble.glm)

The top left graph shows the mean cross-validated RMSE per model, with the bars denoting RMSE standard deviation. The red dashed line shows the RMSE of the ensemble model, which is slightly lower than that of any of the individual models. The middle-left plot shows a large negative weight associated with the elastic net model, smaller weights associated wtih kNN, neural network, and random forest, and a larger positive weight for the intercept.

Stacked ensembles

Stacked ensembles can also be constructed using caretStack, which applies user-defined linear combinations of each constituent model. I’ll try one using a random forest combination of models and one using an elastic net combination of models.

set.seed(127)
# Create random forest-combined stacked ensemble
stacked.ensemble.rf <- caretStack(ensemble.models, method = "rf", metric = "RMSE", trControl = ensemble.control)
# Create elastic net-combined stacked ensemble
stacked.ensemble.glmnet <- caretStack(ensemble.models, method="glmnet", metric="RMSE", trControl = ensemble.control)

The summary statistics for each model can be displayed:

cat("\nStacked ensemble, generalized linear model:\n") 
stacked.ensemble.glm 
cat("\n\nStacked ensemble, random forest:\n")
stacked.ensemble.rf
cat("\n\nStacked ensemble, elastic net:\n")
stacked.ensemble.glmnet
## 
## Stacked ensemble, generalized linear model:
## A glm ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Generalized Linear Model 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE      
##   1.061885  0.08684754  0.6777372
## 
## 
## 
## Stacked ensemble, random forest:
## A rf ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## Random Forest 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 222, 221, 222, 222, 221, 220, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared    MAE      
##   2     1.144390  0.06698688  0.6945539
##   3     1.162099  0.06489556  0.7019328
##   4     1.178654  0.06058592  0.7048839
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 2.
## 
## 
## Stacked ensemble, elastic net:
## A glmnet ensemble of 4 base models: nnet, knn, ranger, glmnet
## 
## Ensemble results:
## glmnet 
## 
## 246 samples
##   4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 220, 222, 221, 221, 222, 221, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda        RMSE      Rsquared   MAE      
##   0.10   0.0003874836  1.035647  0.1052673  0.6773436
##   0.10   0.0038748363  1.035568  0.1052222  0.6772353
##   0.10   0.0387483634  1.034216  0.1040302  0.6752533
##   0.55   0.0003874836  1.035676  0.1052599  0.6773358
##   0.55   0.0038748363  1.035463  0.1050010  0.6769625
##   0.55   0.0387483634  1.033049  0.1031535  0.6730375
##   1.00   0.0003874836  1.035682  0.1052608  0.6773399
##   1.00   0.0038748363  1.035382  0.1047341  0.6766830
##   1.00   0.0387483634  1.032724  0.1026793  0.6725872
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.03874836.

The elastic net-stacked ensemble model shows the lowest cross-validated RMSE and MAE of the three ensembles, though this difference may not be significant and may not translate to the out-of-sample data. These three ensemble models (glm-, rf-, and glmnet-combined) will be used to predict annual change in CDR-Sum of Boxes in the same test dataset. Note: since the models were created with center and scale preprocessing specified, the test data does not need to be manually pre-processed.

Model predictions

Training data predictions

# Predict based on training data for the four individual models
# And the three stacked ensemble models
glmnet.train <- predict.train(ensemble.models$glmnet)
knn.train <- predict.train(ensemble.models$knn)
nnet.train <- predict.train(ensemble.models$nnet)
rf.train <- predict(ensemble.models$ranger)
ensemble.glm.train <- predict(stacked.ensemble.glm)
ensemble.glmnet.train <- predict(stacked.ensemble.glmnet)
ensemble.rf.train <- predict(stacked.ensemble.rf)
real.train <- pca.train$CDRSB

# Combine these predictions into a dataframe for easier viewing
train.df <- do.call(cbind, list(elastic.net=glmnet.train, knn=knn.train, neural.net=nnet.train, random.forest=rf.train,
                                ensemble.glm=ensemble.glm.train, ensemble.glmnet=ensemble.glmnet.train, 
                                ensemble.rf=ensemble.rf.train, real.CDR=real.train)) %>% as.data.frame()

datatable(train.df %>% mutate_if(is.numeric, function(x) round(x,4)) %>% select(real.CDR, elastic.net:ensemble.rf))

The elastic net individual model clearly predicts the same value (0.3539) for all training cases, indicating it will not be of much use to predict the CDR Sum of Boxes change.

Test data predictions

# Predict based on UNSEEN test data for the four individual models
# And the three stacked ensemble models
real.test <- pca.test$CDRSB  
glmnet.test <- predict.train(ensemble.models$glmnet, newdata=pca.test)
knn.test <- predict.train(ensemble.models$knn, newdata=pca.test)
nnet.test <- predict.train(ensemble.models$nnet, newdata=pca.test)
rf.test <- predict.train(ensemble.models$ranger, newdata=pca.test)
ensemble.glm.test <- predict(stacked.ensemble.glm, newdata=pca.test)
ensemble.glmnet.test <- predict(stacked.ensemble.glmnet, newdata=pca.test)
ensemble.rf.test <- predict(stacked.ensemble.rf, newdata=pca.test)

# Combine these predictions into a dataframe for easier viewing
test.df <- do.call(cbind, list(real.CDR=real.test, elastic.net=glmnet.test, knn=knn.test, neural.net=nnet.test, random.forest=rf.test,
                               ensemble.glm=ensemble.glm.test, ensemble.glmnet=ensemble.glmnet.test, ensemble.rf=ensemble.rf.test)) %>% as.data.frame()

datatable(test.df %>% mutate_if(is.numeric, function(x) round(x,4)))

Predicted vs. real CDR-SoB comparison

I want to compare these three ensemble models in terms of how their predictions relate to the actual CDR-SoB values in the training and testing data. I also want to compare these results with those obtained with the individual component models to see if constructing the ensemble confers a predictive advantage. First, I will visualize how the predicted values stack up to to the actual value for annual change in CDR-Sum of Boxes, in both the training and the test data:

# Create training data ggplot to be converted to interactive plotly plot
p.train <- train.df %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Training Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Create test data ggplot to be converted to interactive plotly plot
p.test <- test.df  %>%
  # Reshape to facet on model
  pivot_longer(cols=c(-real.CDR), names_to="Model", values_to="Prediction") %>%
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=real.CDR, y= Prediction, color=Model)) +
  geom_point(alpha=0.3) +
  facet_grid(.~Model, scales="free") +
  ggtitle("Model Predictions for CDR Sum of Boxes Annual Change in Test Data") +
  theme_minimal() +
  theme(legend.position="none") +
  ylab("Predicted CDR-SoB Change") +
  xlab("Actual CDR-SoB Change") +
  theme(plot.title=element_text(hjust=0.5))

# Use ggplotly to create interactive HTML plots
ggplotly(p.train, height=350, width=900)
ggplotly(p.test, height=350, width=900) 

To quantify the association between real CDR-SoB values and model-predicted, I will use the \(R^2\) and RMSE values through the R2 and RMSE functions from caret:

# Calculate the RMSE between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
rmse.train <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.train, real.train),
                         ensemble.glm=RMSE(ensemble.glm.train, real.train),
                         ensemble.rf=RMSE(ensemble.rf.train, real.train),
                         elastic.net=RMSE(glmnet.train, real.train),
                         knn=RMSE(knn.train, real.train),
                         neural.net=RMSE(nnet.train, real.train),
                         random.forest=RMSE(rf.train, real.train),
                         Metric="Train_RMSE")
# Calculate the RMSE between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
rmse.test <- data.frame(ensemble.glmnet=RMSE(ensemble.glmnet.test, real.test),
                        ensemble.glm=RMSE(ensemble.glm.test, real.test),
                        ensemble.rf=RMSE(ensemble.rf.test, real.test),
                        elastic.net=RMSE(glmnet.test, real.test),
                        knn=RMSE(knn.test, real.test),
                        neural.net=RMSE(nnet.test, real.test),
                        random.forest=RMSE(rf.test, real.test),
                        Metric="Test_RMSE")
# Calculate the R-squared between real vs. predicted CDR-SoB values for training data
# Combine into dataframe for easier viewing
r2.train <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.train, real.train),
                       ensemble.glm=R2(ensemble.glm.train, real.train),
                       ensemble.rf=R2(ensemble.rf.train, real.train),
                       elastic.net=R2(glmnet.train, real.train),
                       knn=R2(knn.train, real.train),
                       neural.net=R2(nnet.train, real.train),
                       random.forest=R2(rf.train, real.train),
                       Metric="Train_R2")
# Calculate the R-squared between real vs. predicted CDR-SoB values for unseen test data
# Combine into dataframe for easier viewing
r2.test <- data.frame(ensemble.glmnet=R2(ensemble.glmnet.test, real.test),
                      ensemble.glm=R2(ensemble.glm.test, real.test),
                      ensemble.rf=R2(ensemble.rf.test, real.test),
                      elastic.net=R2(glmnet.test, real.test),
                      knn=R2(knn.test, real.test),
                      neural.net=R2(nnet.test, real.test),
                      random.forest=R2(rf.test, real.test),
                      Metric="Test_R2")

# Combine all four prediction dataframes into one table to compare and contrast
# RMSE and R-squared across models
do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  mutate(Metric = str_replace(Metric, "_", " ")) %>%
  pivot_wider(id_cols="Model", names_from="Metric", values_from="Value") %>%
  mutate_if(is.numeric, function(x) round(x,4)) %>%
  kable(., booktabs=T) %>% kable_styling(full_width=F)
Model Train RMSE Test RMSE Train R2 Test R2
ensemble.glmnet 0.8693 0.8442 0.7380 0.1521
ensemble.glm 0.7375 0.8312 0.7863 0.1600
ensemble.rf 0.8609 0.8731 0.5243 0.0665
elastic.net 1.1418 0.8978 NA NA
knn 1.0950 0.9041 0.1255 0.0411
neural.net 1.1139 0.9090 0.0485 0.0003
random.forest 0.6511 0.8259 0.8860 0.1752

The random forest model shows the lowest RMSE between predicted vs. real CDR-Sum of Boxes for both the training and test data. It also has the highest \(R^2\) value between predicted and real CDR-Sum of Boxes rate of change. However, the training \(R^2\) of 0.8860 is substantially larger than the test data \(R^2\) of 0.1752, suggesting major overfitting.

In the training dataset, the random forest predictions are pretty close, with the glm- and elastic net-stacked ensemble models showing nearly comparable correlations between real vs. predicted values. However, this relationship is not evident in the testing data. Aside from the elastic net individual model, the individual and ensemble models all tend ot over-estimate CDR-Sum of Boxes for test cases with zero actual change. Beyond zero, however, the models tend to underestimate CDR-Sum of Boxes change, particular for values of ~1.5 and above.

# Compile RMSE and R2 results comparing real vs. predicted values for ensembles and component models
overall.ensemble.results <- do.call(plyr::rbind.fill, list(rmse.train, rmse.test, r2.train, r2.test)) %>%
  # Reshape to facet on metric -- i.e. RMSE or R2
  pivot_longer(cols=c(-Metric), names_to="Model", values_to="Value") %>%
  separate(Metric, into=c("Data", "Metric"), sep="_")

p.ensemble.r2.rmse <- overall.ensemble.results %>%
  mutate(Metric = ifelse(Metric=="RMSE", "Real vs. Predicted CDR-SoB RMSE", "Real vs. Predicted CDR-SoB R2")) %>%
  mutate(Data=factor(Data, levels=c("Train", "Test")),
         Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Data, y=Value, color=Model, group=Model)) +
  geom_point() +
  geom_line() +
  theme_minimal() +
  facet_wrap(Metric~., scales="free", nrow=1) +
  theme(strip.text=element_text(size=12, face="bold"),
        axis.title=element_blank())

# Convert to interactive plotly plot, rename x/y axis titles
ggplotly(p.ensemble.r2.rmse) %>% 
  layout(yaxis = list(title = "R2", 
                      titlefont = list(size = 12)),
         xaxis = list(title = "Data Subset", 
                      titlefont = list(size = 12)),
         yaxis2 = list(title = "RMSE", 
                       titlefont = list(size = 12)),
         xaxis2 = list(title = "Data Subset", 
                       titlefont = list(size = 12)),
         autosize = F, width = 900, height = 400)

There are two clusters of model performance here: random forest, elastic net-, and glm-stacked ensemble models show high \(R^2\) in the training data, and all decrease to a very similar quantity (0.15-0.17) in the test dataset. All three show the lowest RMSE as well. By contrast, the neural network, kNN, and random forest-stacked ensemble models show low \(R^2\) and high RMSE. Interestingly, their error is higher in the training dataset.

Re-name objects to have names specific to the PCA-transformed dataset:

# Rename relevant objects to have PCA-transformed data-specific names
models.for.ensemble.PCA <- ensemble.models
model.metrics.PCA <- overall.ensemble.results
stacked.ensemble.glm.PCA <- stacked.ensemble.glm
stacked.ensemble.glmnet.PCA <- stacked.ensemble.glmnet
stacked.ensemble.rf.PCA <- stacked.ensemble.rf

Phase Five: Model Evaluation

I’ll evaluate model performance across the two ROI aggregation types:

  • Original data
  • PCA-transformed data

Here, I will define “model performance” as the RMSE and R\(^2\) for agreement between predicted vs. actual CDR-Sum of Boxes change per year in the test dataset. There are seven models utilized, including four individual models and three stacked ensembles.

model.metrics.OG$ROI_Type <- "Original"
model.metrics.PCA$ROI_Type <- "PCA"

# Concatenate the four dataframes into one dataframe
model.metrics.full <- do.call(plyr::rbind.fill, list(model.metrics.OG,
                                                     model.metrics.PCA))

Predicted vs. real CDR-SoB comparison

I’ll use bar plots to visualize the RMSE and R\(^2\) metrics for each of the seven models:

# Define two distinct color palettes using hex colors
rmse.pal <- c("#C0D3D9", "#7DBCDD", "#2D69A5", "#2A486C")
r2.pal <- c("#fcb9b2", "#f67e7d", "#843b62", "#621940")

# Create static ggplot
p.rmse <- model.metrics.full %>%
  # Plot the RMSE for test data predictions
  filter(Data=="Test", Metric=="RMSE") %>%
  # Show models in specific order on x-axis
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Model, y=Value, fill=ROI_Type)) +
  geom_bar(stat="identity", position=position_dodge()) +
  labs(fill="ROI Type") +
  theme_minimal() +
  ggtitle("RMSE Between Predicted vs. Actual CDR-SoB") +
  # Use pre-defined color palette
  scale_fill_manual(values=rmse.pal) +
  ylab("RMSE") +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        plot.title=element_text(hjust=0.5),
        strip.text = element_text(size=12, face="bold"),
        legend.position = "bottom")

# Convert to plotly interactive visualizatoin
ggplotly(p.rmse, width=700, height=300)
p.r2 <- model.metrics.full %>%
  # Plot the R2 for test data predictions
  filter(Data=="Test", Metric=="R2") %>%
  # Show models in specific order on x-axis
  mutate(Model=factor(Model, levels=c("ensemble.glmnet", "ensemble.glm", "ensemble.rf", "elastic.net", "knn", "neural.net", "random.forest"))) %>%
  ggplot(data=., mapping=aes(x=Model, y=Value, fill=ROI_Type)) +
  geom_bar(stat="identity", position=position_dodge()) +
  labs(fill="ROI Type") +
  theme_minimal() +
  ggtitle("R2 Between Predicted vs. Actual CDR-SoB") +
  # Use pre-defined color palette
  scale_fill_manual(values=r2.pal) +
  ylab("R2") +
  theme(axis.text.x = element_text(angle=45, hjust=1),
        plot.title=element_text(hjust=0.5),
        strip.text = element_text(size=12, face="bold"),
        legend.position = "bottom")
ggplotly(p.r2, width=700, height=300)

The RMSE doesn’t afford much distinction either by ROI averageing type or by predictive model. If anything, the RMSE does stand out as a bit larger for the original ROI configuration in the random forest-stacked ensemble (ensemble.rf) and the neural network (neural.net). However, the R\(^2\) agreement between predicted vs. actual CDR-SoB does differentiate among the models. The random forest model (random.forest) actually shows the highest R\(^2\) value for each the two ROI configurations, surpassing all three of the stacked ensemble models. The PCA-transformed data shows the largest R\(^2\) agreement of the four ROI configurations, followed closely by the original data configuration; however, even these values are <0.2, suggesting minimal association between the predictions and the actual values.

Interpretation

At the end of the day, none of the models and corresponding ROI configurations offered strong predictions for CDR-Sum of Boxes annual change based on regional changes in tau-PET tracer uptake. To improve this model in the future, I would consider including the baseline tau-PET SUVR value in a mixed-effects model; conceivably, an increase in 0.5 SUVR could have different implications for a subject with no tau-PET uptake at baseline versus a subject with high tau-PET uptake at baseline. Also, it’s entirely possible that there is a temporal lag between tau-PET uptake and cognitive decline; for example, tau accumulation in a given ROI (e.g. hippocampus) may precede cognitive decline by months or years, which would not be detected in this model. This is a complex pathophysiological dynamic that necessitates complex modeling supported by extensive domain knowledge.

That being said, I am interested in a yet-unexamined facet of this project: how does a given region of interest influence a given model? How important is e.g. the hippocampus relative to e.g. the insula?

Variable contributions

To answer these questions, I will use the varImp function from caret, which computes the relative importance of each variable used as model input. Since variable importance is not readily identifiable for caretStack ensembles, I’ll focus exclusively on individual models. Variable importance doesn’t really apply for k-Nearest Neighbors (kNN), so I will examine the random forest, elastic net, and neural network regression models. For elastic net regression, I will also extract ROI coefficients.

First, I’ll examine the original ROI conformation:

# Extract variable importance from random forest regression model
rf.var.imp <- varImp(models.for.ensemble.OG$ranger)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Random Forest")

# Extract variable importance from elastic net regression model
glmnet.var.imp <- varImp(models.for.ensemble.OG$glmnet)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Elastic Net")

# Extract variable coefficients from optimal elastic net regression model
glmnet.var.coef <- as.data.frame(as.matrix(coef(models.for.ensemble.OG$glmnet$finalModel, models.for.ensemble.OG$glmnet$bestTune$lambda))) %>%
  rownames_to_column(var="Term") %>% dplyr::rename("Coefficient"="1") %>%
  mutate(Model="Elastic Net")

# Combine importance + coefficients for elastic net regression dataframe
glmnet.var <- left_join(glmnet.var.imp, glmnet.var.coef)

# Extract variable importance from neural network regression model
nnet.var.imp <- varImp(models.for.ensemble.OG$nnet)[[1]] %>%
  rownames_to_column(var="Term") %>%
  dplyr::rename("Importance"="Overall") %>%
  mutate(Model="Neural Network")

# Concatenate these four dataframes into one dataframe
og.importance <- do.call(plyr::rbind.fill, list(rf.var.imp, glmnet.var, nnet.var.imp))
datatable(og.importance)
# Pre-define color palette
my.pal <- colorRampPalette(c("#C0D3D9", "#7DBCDD", "#2D69A5", "#2A486C"))(200)

# Plot the variable importance for each model
p.importance <- og.importance %>%
  ggplot(data=., mapping=aes(x=Term, y=Importance, fill=Importance)) +
  geom_bar(stat="identity") +
  # Facet by model, one model per row
  facet_wrap(Model ~ ., scales="free_y", nrow=3) +
  theme_minimal() +
  scale_fill_gradientn(colors=my.pal) +
  # Rotate x-axis text for legibility
  theme(axis.text.x = element_text(angle=90, hjust=1),
        axis.title=element_blank(),
        strip.text = element_text(size=14))

# Convert to interactive plotly visualization
ggplotly(p.importance, width=800, height=700) %>%
  layout(xaxis = list(title = "Model Term", 
                      titlefont = list(size = 14)), 
         yaxis=list(title="Importance", 
                    titlefont = list(size = 12)),
         yaxis2 = list(title="Importance", 
                       titlefont = list(size = 12)),
         yaxis3 = list(title="Importance", 
                       titlefont = list(size = 12)))

It’s really interesting how differently these three models weigh each model term. The elastic net model placed some of the largest importance in the bankssts, entorhinal, hippocampus, and inferiortemporal ROIs, which typically show some of the heaviest tau pathology burdens in the human Alzheimer’s Disease brain. The neural network found sex to be the most important predictive variable by far, which is surprising and may contribute to the relatively low predictive accuracy associated with the neural network. The random forest identified the inferiorparietal, isthmuscingulate, insula, and parahippocampal ROIs as the most important predictor terms.

Variable coefficients

I’m curious as to how variable importance relates to the regularized coefficient calculated for each predictor term in the elastic net model:

# Pre-define color palette
my.pal <- colorRampPalette(c("#fcb9b2", "#f67e7d", "#843b62", "#621940"))(200)

# Plot variable coefficients and fill with variable importance
p.enet <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  # Rearrange term by coefficient using forcats::fct_reorder
  ggplot(data=., mapping=aes(x=fct_reorder(Term, Coefficient,), y=Coefficient, 
                             fill=Importance, label=Term)) +
  geom_bar(stat="identity") +
  xlab("Term") +
  theme_minimal() +
  scale_fill_gradientn(colors=my.pal) +
  ggtitle("Variable Coefficients & Importance in Elastic Net Regression") +
  theme(axis.text.x=element_text(angle=90),
        plot.title=element_text(hjust=0.5))

# tooltip specifies only to show the label (term), y (coefficient), and fill (importance) upon cursor hover
ggplotly(p.enet, tooltip=c("label", "y", "fill"), width=700, height=400)

The most important features bookend the x-axis, and also have the largest magnitude of coefficients. It’s pretty surprising to note that the entorhinal cortex and hippocampus have two of the most negative coefficients of all the variables; a negative coefficient means that rate of tau accumulation in that ROI is negatively associated with CDR-Sum of Boxes score increase. The entorhinal cortex and hippocampus are two of the first gray matter regions to develop tau pathology in Alzheimer’s Disease according to the Braak staging paradigm, so I would expect that increased tau accumulation in these regions would be associated with increased CDR-Sum of Boxes scores, not a decrease (note: a higher CDR-Sum of Boxes score means greater cognitive impairment). One possible interpretation could be related to the temporal gap between tau neurofibrillary tangle accumulation and cognitive decline as described above; though further assessment would be warranted to interpret the implications of these negative coefficients.

Regional importance and coefficients

I like to use the ggseg package to visualize these elastic net regression coefficients and variable importance metrics within the brain space:

# Pre-defined color palette
my.pal <- colorRampPalette(c("#fcb9b2", "#f67e7d", "#843b62", "#621940"))(200)

# Plot ROI importance in brain using ggseg
p.roi.imp <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  # Remove the non-brain ROI terms
  filter(!(Term %in% c("Sex_Male", "Age_Baseline"))) %>%
  left_join(., ggseg.aparc, by=c("Term"="tau_ROI")) %>%
  rename("region"="ggseg_ROI") %>%
  filter(!is.na(region)) %>%
  # Convert to brain representation using Desikan-Killiany (dk) atlas
  ggseg(atlas="dk", mapping=aes(fill=Importance, label=region)) +
  ggtitle("Elastic Net ROI Importance") +
  # Label left and right hemispheres
  annotate(geom="text", x=410, y=-100, label="Left") +
  annotate(geom="text", x=1290, y=-100, label="Right") +
  scale_fill_gradientn(colors=my.pal) +
  # Use calibri font
  theme(axis.text=element_blank(),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"),
        plot.title=element_text(hjust=0.5, family="calibri"),
        text=element_text(family="calibri"))

# Convert to interactive plotly visualization
ggplotly(p.roi.imp, dynamicTicks = T, tooltip=c("label", "fill"), width=700, height=300)

One interesting observation is that the ROIs with the largest relative importance are located at the lateral edges of the cortex rather than the medial edge, with the exception of the entorhinal and pericalcarine cortices.

The elastic net coefficient weights can also be viewed in the brain:

# Plot ROI coefficient in brain
p.roi.coef <- og.importance %>%
  filter(Model=="Elastic Net") %>%
  # Remove the non-brain ROI terms
  filter(!(Term %in% c("Sex_Male", "Age_Baseline"))) %>%
  left_join(., ggseg.aparc, by=c("Term"="tau_ROI")) %>%
  rename("region"="ggseg_ROI") %>%
  filter(!is.na(region)) %>%
  # Convert to brain representation using Desikan-Killiany (dk) atlas
  ggseg(atlas="dk", mapping=aes(fill=Coefficient, label=region)) +
  ggtitle("Elastic Net ROI Coefficients") +
  # Label left and right hemispheres
  annotate(geom="text", x=410, y=-100, label="Left") +
  annotate(geom="text", x=1290, y=-100, label="Right") +
  scale_fill_continuous_divergingx(palette = 'RdBu', rev=T, mid = 0) +
  # Use calibri font
  theme(axis.text=element_blank(),
        axis.title.x = element_text(family="calibri"),
        legend.title=element_text(family="calibri"),
        legend.text=element_text(family="calibri"),
        plot.title=element_text(hjust=0.5, family="calibri"),
        text=element_text(family="calibri"))

# Convert to interactive plotly visualization
ggplotly(p.roi.coef, dynamicTicks = T, tooltip=c("label", "fill"), width=700, height=300)

Phase Six: Deployment

Data reshaping

Combine original train + original test into full df:

original.full <- plyr::rbind.fill(original.train, original.test)

Pivot the original ROI configuration dataset from wide to long (preparation for correlation matrix), omitting age at baseline and sex:

# Pivot wide --> long
original.roi.long <- original.full %>%
  select(-Age_Baseline, -Sex_Male) %>%
  pivot_longer(cols=c(-CDRSB), names_to="ROI", values_to="SUVR_Change")

Pairwise correlations

Calculate pairwise ROI tau-PET SUVR correlations using rcorr function from the Hmisc package:

# Convert tau-PET regional SUVR change to a matrix
original.roi.mat <- original.full %>%
  select(-Age_Baseline, -Sex_Male, -CDRSB) %>%
  as.matrix()

# Use rcorr from Hmisc to calculate Pearson correlation coefficients ($r)
original.roi.corr <- rcorr(original.roi.mat)
original.roi.corr.coef <- original.roi.corr$r

For visualization, it’s easier to then convert this correlation matrix from wide to long. I’ll utilize the pivot_longer function from the tidyr package to reshape the dataframes for the correlation coefficients.

# Correlation coefficient (Pearson)
original.roi.corr.coef.long <- as.data.frame(original.roi.corr.coef) %>%
  # Cast rownames as a dataframe column
  rownames_to_column(var="ROI1") %>%
  # Pivot data wide --> long
  pivot_longer(cols=c(-ROI1), names_to="ROI2", values_to="Pearson_Corr") %>%
  # Omit correlations between the same ROI, which are always equal to 1
  filter(ROI1 != ROI2) %>%
  # Omit instances where the two ROIs are the same, just in different columns
  mutate(ROI12 = ifelse(ROI1<ROI2, paste(ROI1, ROI2, sep="_"),
                        paste(ROI2, ROI1, sep="_"))) %>%
  distinct(ROI12, .keep_all=T) %>%
  select(-ROI12)


# Save the final long-format dataframe
original.roi.corr.results <- original.roi.corr.coef.long

iGraph preparation

# Edges are defined as cortical lobe --> specific ROI connection
edges <- read.csv("https://raw.githubusercontent.com/anniegbryant/DA5030_Final_Project/master/6_Deployment/tau_roi_nodes.csv") %>% distinct()
# ROIs don't include the origin --> cortical lobe connection
rois <- edges %>% filter(!(to %in% c("Cingulate", "Frontal", "Insula",
                                     "Occipital", "Parietal", "Temporal")))

# Create a dataframe of vertices, one line per object in the ROI cortical lobe hierarchy
vertices = data.frame(name = unique(c(as.character(edges$from), as.character(edges$to))))
vertices$group <- edges$from[match(vertices$name, edges$to)]

# Create an igraph object
mygraph <- graph_from_data_frame(edges, vertices=vertices)

Save data

Save this data to an .RData file to be loaded into the Shiny app:

save(original.roi.corr.coef, og.importance, original.roi.corr.results, ggseg.aparc, ggseg.aseg, rois, vertices, mygraph, file="RData/shiny_data.RData")